Table of Contents
This is an introduction to Git from a graph theory point of view. In my view, most introductions to Git focus on the actual commands or on Git internals. In my day-to-day work, I realized that I consistently rely on an internal model of the repository as a directed acyclic graph. I also tend to use this point of view when explaining my workflow to other people, with some success. This is definitely not original, many people have said the same thing, to the point that it is a running joke. However, I have not seen a comprehensive introduction to Git from this point of view, without clutter from the Git command line itself.
How to actually use the command line is not the topic of this article, you can refer to the man pages or the excellent Pro Git bookSee “Further reading” below.
. I will reference the relevant Git commands as margin notes.
Concepts: understanding the graph
The basic object in Git is the commit. It is constituted of three things: a set of parent commits (at least one, except for the initial commit), a diff representing changes (some lines are removed, some are added), and a commit message. It also has a nameActually, each commit gets a SHA-1 hash that identifies it uniquely. The hash is computed from the parents, the messages, and the diff.
, so that we can refer to it if needed.
A repository is fundamentally just a directed acyclic graph (DAG) You can visualize the graph of a repo, or just a subset of it, using
, where nodes are commits and links are parent-child relationships. A DAG means that two essential properties are verified at all time by the graph:
- it is oriented, and the direction always go from parent to child,
- it is acyclic, otherwise a commit could end up being an ancestor of itself.
As you can see, these make perfect sense in the context of a version-tracking system.
Here is an example of a repo:
In this representation, each commit points to its childrenIn the actual implementation, the edges are the other way around: each commit points to its parents. But I feel like it is clearer to visualize the graph ordered with time.
, and they were organized from left to right as in a timeline. The initial commit is the first one, the root of the graph, on the far left.
Note that a commit can have multiple children, and multiple parents (we’ll come back to these specific commits later).
The entirety of Git operations can be understood in terms of manipulations of the graph. In the following sections, we’ll list the different actions we can take to modify the graph.
Naming things: branches and tags
Some commits can be annotated: they can have a named label attached to them, that reference a specific commit.
HEAD references the current commit: your current position in the graph Move around the graph (i.e. move the
HEAD pointer), using
git checkout. You can give it commit hashes, branch names, tag names, or relative positions like
HEAD~3 for the great-grandparent of the current commit.
. This is just a convenient name for the current commit.Much like how
. is a shorthand for the current directory when you’re navigating the filesystem.
Branches are other labels like this. Each of them has a name and acts a simple pointer to a commit. Once again, this is simply an alias, in order to have meaningful names when navigating the graph.
In this example, we have three branches:
bugfixDo not name your real branches like this! Find a meaningful name describing what changes you are making.
. Note that there is nothing special about the names: we can use any name we want, and the
master branch is not special in any way.
Tags Create branches and tags with the appropriately named
git branch and
are another kind of label, once again pointing to a particular commit. The main difference with branches is that branches may move (you can change the commit they point to if you want), whereas tags are fixed forever.
Making changes: creating new commits
When you make some changes in your files, you will then record them in the repo by committing them To the surprise of absolutely no one, this is done with
. The action creates a new commit, whose parent will be the current commit. For instance, in the previous case where you were on
master, the new repo after committing will be (the new commit is in green):
Two significant things happened here:
- Your position on the graph changed:
HEADpoints to the new commit you just created.
- More importantly:
mastermoved as well. This is the main property of branches: instead of being “dumb” labels pointing to commits, they will automatically move when you add new commits on top of them. (Note that this won’t be the case with tags, which always point to the same commit no matter what.)
If you can add commits, you can also remove them (if they don’t have any children, obviously). However, it is often better to add a commit that will revert Create a revert commit with
git revert, and remove a commit with
git reset (destructive!).
the changes of another commit (i.e. apply the opposite changes). This way, you keep track of what’s been done to the repository structure, and you do not lose the reverted changes (should you need to re-apply them in the future).
There is a special type of commits: merge commits, which have more than one parent (for example, the fifth commit from the left in the graph above). As can be expected, the command is
At this point, we need to talk about conflicts. See Pro Git’s chapter on merging and basic conflict resolution for the details on managing conflicts in practice.
Until now, every action was simple: we can move around, add names, and add some changes. But now we are trying to reconcile two different versions into a single one. These two versions can be incompatible, and in this case the merge commit will have to choose which lines of each version to keep. If however, there is no conflict, the merge commit will be empty: it will have two parents, but will not contain any changes itself.
Moving commits: rebasing and squashing
Until now, all the actions we’ve seen were append-only. We were only adding stuff, and it would be easy to just remove a node from the graph, and to move the various labels accordingly, to return to the previous state.
Sometimes we want to do more complex manipulation of the graph: moving a commit and all its descendants to another location in the graph. This is called a rebase. That you can perform with
git rebase (destructive!).
In this case, we moved the branch
feature from its old position (in red) to a new one on top of
master (in green).
When I say “move the branch
feature”, I actually mean something slightly different than before. Here, we don’t just move the label
feature, but also the entire chain of commits starting from the one pointed by
feature up to the common ancestor of
feature and its base branch (here
In practice, what we have done is deleted three commits, and added three brand new commits. Git actually helps us here by creating commits with the same changes. Sometimes, it is not possible to apply the same changes exactly because the original version is not the same. For instance, if one of the commits changed a line that no longer exist in the new base, there will be a conflict. When rebasing, you may have to manually resolve these conflicts, similarly to a merge.
It is often interesting to rebase before merging, because then we can avoid merge commits entirely. Since
feature has been rebased on top of
master, when merging
master, we can just fast-forward
master, in effect just moving the
master label where
feature is: You can control whether or not
git merge does a fast-forward with the
Another manipulation that we can do on the graph is squashing, i.e. lumping several commits together in a single one. Use
git squash (destructive!).
Here, the three commits of the
feature branch have been condensed into a single one. No conflict can happen, but we lose the history of the changes. Squashing may be useful to clean up a complex history.
Squashing and rebasing, taken together, can be extremely powerful tools to entirely rewrite the history of a repo. With them, you can reorder commits, squash them together, moving them elsewhere, and so on. However, these commands are also extremely dangerous: since you overwrite the history, there is a lot of potential for conflicts and general mistakes. By contrast, merges are completely safe: even if there are conflicts and you have messed them up, you can always remove the merge commit and go back to the previous state. But when you rebase a set of commits and mess up the conflict resolution, there is no going back: the history has been lost forever, and you generally cannot recover the original state of the repository.
Remotes: sharing your work with others
You can use Git as a simple version tracking system for your own projects, on your own computer. But most of the time, Git is used to collaborate with other people. For this reason, Git has an elaborate system for sharing changes with others. The good news is: everything is still represented in the graph! There is nothing fundamentally different to understand.
When two different people work on the same project, each will have a version of the repository locally. Let’s say that Alice and Bob are both working on our project.
Alice has made a significant improvement to the project, and has created several commits, that are tracked in the
feature branch she has created locally. The graph above (after rebasing) represents Alice’s repository. Bob, meanwhile, has the same repository but without the
feature branch. How can they share their work? Alice can send the commits from
feature to the common ancestor of
feature to Bob. Bob will see this branch as part of a remote graph, that will be superimposed on his graph: You can add, remove, rename, and generally manage remotes with
git remote. To transfer data between you and a remote, use
git pull (which fetches and merges in your local branch automatically), and
The branch name he just got from Alice is prefixed by the name of the remote, in this case
alice. These are just ordinary commits, and an ordinary branch (i.e. just a label on a specific commit).
Now Bob can see Alice’s work, and has some idea to improve on it. So he wants to make a new commit on top of Alice’s changes. But the
alice/feature branch is here to track the state of Alice’s repository, so he just creates a new branch just for him named
feature, where he adds a commit:
Similarly, Alice can now retrieve Bob’s work, and will have a new branch
bob/feature with the additional commit. If she wants, she can now incorporate the new commit to her own branch
feature, making her branches
As you can see, sharing work in Git is just a matter of having additional branches that represent the graph of other people. Some branches are shared among different people, and in this case you will have several branches, each prefixed with the name of the remote. Everything is still represented simply in a single graph.
Unfortunately, some things are not captured in the graph directly. Most notably, the staging area used for selecting changes for committing, stashing, and submodules greatly extend the capabilities of Git beyond simple graph manipulations. You can read about all of these in Pro Git.
Note: This section is not needed to use Git every day, or even to understand the concepts behind it. However, it can quickly show you that the explanations above are not pure abstractions, and are actually represented directly this way.
Let’s dive a little bit into Git’s internal representations to better understand the concepts. The entire Git repository is contained in a
.git folder, you will find a simple text file called
HEAD, which contains a reference to a location in the graph. For instance, it could contain
ref: refs/heads/master. As you can see,
HEAD really is just a pointer, to somewhere called
refs/heads/master. Let’s look into the
refs directory to investigate:
cat refs/heads/master $ f19bdc9bf9668363a7be1bb63ff5b9d6bfa965dd
This is just a pointer to a specific commit! You can also see that all the other branches are represented the same way.You must have noticed that our graphs above were slightly misleading:
HEAD does not point directly to a commit, but to a branch, which itself points to a commit. If you make
HEAD point to a commit directly, this is called a “detached HEAD” state.
Remotes and tags are similar: they are in
Commits are stored in the
objects directory, in subfolders named after the first two characters of their hashes. So the commit above is located at
objects/f1/9bdc9bf9668363a7be1bb63ff5b9d6bfa965dd. They are usually in a binary format (for efficiency reasons) called packfiles. But if you inspect it (with
git show), you will see the entire contents (parents, message, diff).
To know more about Git, specifically how to use it in practice, I recommend going through the excellent Pro Git book, which covers everything there is to know about the various Git commands and workflows.
The Git man pages (also available via
man on your system) have a reputation of being hard to read, but once you have understood the concepts behind repos, commits, branches, and remotes, they provide an invaluable resource to exploit all the power of the command line interface and the various commands and options.Of course, you could also use the awesome Magit in Emacs, which will greatly facilitate your interactions with Git with the additional benefit of helping you discover Git’s capabilities.
Finally, if you are interested in the implementation details of Git, you can follow Write yourself a Git and implement Git yourself! (This is surprisingly quite straightforward, and you will end up with a much better understanding of what’s going on.) The chapter on Git in Brown and Wilson (2012) is also excellent.
Brown, Amy, and Greg Wilson. 2012. The Architecture of Open Source Applications, Volume II. Creative Commons. https://www.aosabook.org/en/index.html.