From graphs to Git
Table of Contents
This is an introduction to Git from a graph theory point of view. In my view, most introductions to Git focus on the actual commands or on Git internals. In my day-to-day work, I realized that I consistently rely on an internal model of the repository as a directed acyclic graph. I also tend to use this point of view when explaining my workflow to other people, with some success. This is definitely not original, many people have said the same thing, to the point that it is a running joke. However, I have not seen a comprehensive introduction to Git from this point of view, without clutter from the Git command line itself.
How to actually use the command line is not the topic of this article,
you can refer to the man pages or the excellent Pro Git bookSee
“Further reading” below.
. I will reference the relevant Git commands as margin notes.
Concepts: understanding the graph
The basic object in Git is the commit. It is constituted of three
things: a set of parent commits (at least one, except for the initial
commit), a diff representing changes (some lines are removed, some are
added), and a commit message. It also has a nameActually, each commit gets a SHA-1 hash that identifies it
uniquely. The hash is computed from the parents, the messages, and the
, so that we can refer to it if needed.
A repository is fundamentally just a directed acyclic graph
(DAG) You can visualize the graph of a repo, or just a subset
of it, using
, where nodes are commits and links are parent-child relationships. A DAG means that two essential properties are verified at all time by the graph:
- it is oriented, and the direction always go from parent to child,
- it is acyclic, otherwise a commit could end up being an ancestor of itself.
As you can see, these make perfect sense in the context of a version-tracking system.
Here is an example of a repo:
In this representation, each commit points to its
childrenIn the actual implementation, the edges are the
other way around: each commit points to its parents. But I feel like
it is clearer to visualize the graph ordered with time.
, and they were organized from left to right as in a timeline. The initial commit is the first one, the root of the graph, on the far left.
Note that a commit can have multiple children, and multiple parents (we’ll come back to these specific commits later).
The entirety of Git operations can be understood in terms of manipulations of the graph. In the following sections, we’ll list the different actions we can take to modify the graph.
Naming things: branches and tags
Some commits can be annotated: they can have a named label attached to them, that reference a specific commit.
HEAD references the current commit: your current
position in the graph Move around the graph (i.e. move the
git checkout. You can give it commit hashes, branch
names, tag names, or relative positions like
HEAD~3 for the
great-grandparent of the current commit.
. This is just a convenient name for the current commit.Much like how
. is a shorthand for the
current directory when you’re navigating the filesystem.
Branches are other labels like this. Each of them has a name and acts a simple pointer to a commit. Once again, this is simply an alias, in order to have meaningful names when navigating the graph.
In this example, we have three branches:
bugfixDo not name your real branches like this! Find a
meaningful name describing what changes you are making.
. Note that there is nothing special about the names: we can use any name we want, and the
master branch is not special in any way.
Tags Create branches and tags with the appropriately
git branch and
are another kind of label, once again pointing to a particular commit. The main difference with branches is that branches may move (you can change the commit they point to if you want), whereas tags are fixed forever.
Making changes: creating new commits
When you make some changes in your files, you will then record them in
the repo by committing them To the surprise of absolutely no one, this is done
. The action creates a new commit, whose parent will be the current commit. For instance, in the previous case where you were on
master, the new repo after
committing will be (the new commit is in green):
Two significant things happened here:
- Your position on the graph changed:
HEADpoints to the new commit you just created.
- More importantly:
mastermoved as well. This is the main property of branches: instead of being “dumb” labels pointing to commits, they will automatically move when you add new commits on top of them. (Note that this won’t be the case with tags, which always point to the same commit no matter what.)
If you can add commits, you can also remove them (if they don’t have
any children, obviously). However, it is often better to add a commit
that will revert Create a revert commit with
git revert, and remove a
git reset (destructive!).
the changes of another commit (i.e. apply the opposite changes). This way, you keep track of what’s been done to the repository structure, and you do not lose the reverted changes (should you need to re-apply them in the future).
There is a special type of commits: merge commits, which have more
than one parent (for example, the fifth commit from the left in the
graph above). As can be expected, the command is
At this point, we need to talk about conflicts. See Pro Git’s chapter on merging and basic conflict resolution for the details on managing conflicts in practice.
Until now, every action was simple: we can move around, add names, and add some changes. But now we are trying to reconcile two different versions into a single one. These two versions can be incompatible, and in this case the merge commit will have to choose which lines of each version to keep. If however, there is no conflict, the merge commit will be empty: it will have two parents, but will not contain any changes itself.
Moving commits: rebasing and squashing
Until now, all the actions we’ve seen were append-only. We were only adding stuff, and it would be easy to just remove a node from the graph, and to move the various labels accordingly, to return to the previous state.
Sometimes we want to do more complex manipulation of the graph: moving
a commit and all its descendants to another location in the
graph. This is called a rebase. That you can perform
git rebase (destructive!).
In this case, we moved the branch
feature from its old position (in
red) to a new one on top of
master (in green).
When I say “move the branch
feature”, I actually mean something
slightly different than before. Here, we don’t just move the label
feature, but also the entire chain of commits starting from the one
feature up to the common ancestor of
feature and its
base branch (here
In practice, what we have done is deleted three commits, and added three brand new commits. Git actually helps us here by creating commits with the same changes. Sometimes, it is not possible to apply the same changes exactly because the original version is not the same. For instance, if one of the commits changed a line that no longer exist in the new base, there will be a conflict. When rebasing, you may have to manually resolve these conflicts, similarly to a merge.
It is often interesting to rebase before merging, because then we can
avoid merge commits entirely. Since
feature has been rebased on top
master, when merging
master, we can just
master, in effect just moving the
feature is: You can control whether or not
git merge does a
fast-forward with the
Another manipulation that we can do on the graph is squashing,
i.e. lumping several commits together in a single one.
git squash (destructive!).
Here, the three commits of the
feature branch have been condensed
into a single one. No conflict can happen, but we lose the history of
the changes. Squashing may be useful to clean up a complex history.
Squashing and rebasing, taken together, can be extremely powerful tools to entirely rewrite the history of a repo. With them, you can reorder commits, squash them together, moving them elsewhere, and so on. However, these commands are also extremely dangerous: since you overwrite the history, there is a lot of potential for conflicts and general mistakes. By contrast, merges are completely safe: even if there are conflicts and you have messed them up, you can always remove the merge commit and go back to the previous state. But when you rebase a set of commits and mess up the conflict resolution, there is no going back: the history has been lost forever, and you generally cannot recover the original state of the repository.
Remotes: sharing your work with others
You can use Git as a simple version tracking system for your own projects, on your own computer. But most of the time, Git is used to collaborate with other people. For this reason, Git has an elaborate system for sharing changes with others. The good news is: everything is still represented in the graph! There is nothing fundamentally different to understand.
When two different people work on the same project, each will have a version of the repository locally. Let’s say that Alice and Bob are both working on our project.
Alice has made a significant improvement to the project, and has
created several commits, that are tracked in the
feature branch she
has created locally. The graph above (after rebasing) represents
Alice’s repository. Bob, meanwhile, has the same repository but
feature branch. How can they share their work? Alice can
send the commits from
feature to the common ancestor of
feature to Bob. Bob will see this branch as part of a remote
graph, that will be superimposed on his graph: You can add, remove, rename, and generally manage
git remote. To transfer data between you and a remote,
git pull (which fetches and merges in your local
branch automatically), and
The branch name he just got from Alice is prefixed by the name of the
remote, in this case
alice. These are just ordinary commits, and an
ordinary branch (i.e. just a label on a specific commit).
Now Bob can see Alice’s work, and has some idea to improve on it. So
he wants to make a new commit on top of Alice’s changes. But the
alice/feature branch is here to track the state of Alice’s
repository, so he just creates a new branch just for him named
feature, where he adds a commit:
Similarly, Alice can now retrieve Bob’s work, and will have a new
bob/feature with the additional commit. If she wants, she can
now incorporate the new commit to her own branch
feature, making her
As you can see, sharing work in Git is just a matter of having additional branches that represent the graph of other people. Some branches are shared among different people, and in this case you will have several branches, each prefixed with the name of the remote. Everything is still represented simply in a single graph.
Unfortunately, some things are not captured in the graph directly. Most notably, the staging area used for selecting changes for committing, stashing, and submodules greatly extend the capabilities of Git beyond simple graph manipulations. You can read about all of these in Pro Git.
Note: This section is not needed to use Git every day, or even to understand the concepts behind it. However, it can quickly show you that the explanations above are not pure abstractions, and are actually represented directly this way.
Let’s dive a little bit into Git’s internal representations to better
understand the concepts. The entire Git repository is contained in a
.git folder, you will find a simple text file called
HEAD, which contains a reference to a location in the graph. For
instance, it could contain
ref: refs/heads/master. As you can see,
HEAD really is just a pointer, to somewhere called
refs/heads/master. Let’s look into the
refs directory to
$ cat refs/heads/master f19bdc9bf9668363a7be1bb63ff5b9d6bfa965dd
This is just a pointer to a specific commit! You can also see that all
the other branches are represented the same way.You must have
noticed that our graphs above were slightly misleading:
not point directly to a commit, but to a branch, which itself points
to a commit. If you make
HEAD point to a commit directly, this is
called a “detached HEAD” state.
Remotes and tags are similar: they are in
Commits are stored in the
objects directory, in subfolders named
after the first two characters of their hashes. So the commit above is
are usually in a binary format (for efficiency reasons) called
packfiles. But if you inspect it (with
git show), you will see the
entire contents (parents, message, diff).
To know more about Git, specifically how to use it in practice, I recommend going through the excellent Pro Git book, which covers everything there is to know about the various Git commands and workflows.
The Git man pages (also available via
man on your system) have a
reputation of being hard to read, but once you have understood the
concepts behind repos, commits, branches, and remotes, they provide an
invaluable resource to exploit all the power of the command line
interface and the various commands and options.Of course,
you could also use the awesome Magit in Emacs, which will greatly
facilitate your interactions with Git with the additional benefit of
helping you discover Git’s capabilities.
Finally, if you are interested in the implementation details of Git, you can follow Write yourself a Git and implement Git yourself! (This is surprisingly quite straightforward, and you will end up with a much better understanding of what’s going on.) The chapter on Git in Brown and Wilson (2012) is also excellent.