From graphs to Git

March 8, 2021

Introduction
Concepts: understanding the graph
Remotes: sharing your work with others
Additional concepts
Internals
Further reading
References

Introduction

This is an introduction to Git from a graph theory point of view. In my view, most introductions to Git focus on the actual commands or on Git internals. In my day-to-day work, I realized that I consistently rely on an internal model of the repository as a directed acyclic graph. I also tend to use this point of view when explaining my workflow to other people, with some success. This is definitely not original, many people have said the same thing, to the point that it is a running joke. However, I have not seen a comprehensive introduction to Git from this point of view, without clutter from the Git command line itself.

How to actually use the command line is not the topic of this article, you can refer to the man pages or the excellent Pro Git bookSee “Further reading” below.

. I will reference the relevant Git commands as margin notes.

Concepts: understanding the graph

Repository

The basic object in Git is the commit. It is constituted of three things: a set of parent commits (at least one, except for the initial commit), a diff representing changes (some lines are removed, some are added), and a commit message. It also has a nameActually, each commit gets a SHA-1 hash that identifies it uniquely. The hash is computed from the parents, the messages, and the diff.

, so that we can refer to it if needed.

A repository is fundamentally just a directed acyclic graph (DAG)⊕You can visualize the graph of a repo, or just a subset of it, using git log.

, where nodes are commits and links are parent-child relationships. A DAG means that two essential properties are verified at all time by the graph:

it is oriented, and the direction always go from parent to child,
it is acyclic, otherwise a commit could end up being an ancestor of itself.

As you can see, these make perfect sense in the context of a version-tracking system.

Here is an example of a repo:

In this representation, each commit points to its childrenIn the actual implementation, the edges are the other way around: each commit points to its parents. But I feel like it is clearer to visualize the graph ordered with time.

, and they were organized from left to right as in a timeline. The initial commit is the first one, the root of the graph, on the far left.

Note that a commit can have multiple children, and multiple parents (we’ll come back to these specific commits later).

The entirety of Git operations can be understood in terms of manipulations of the graph. In the following sections, we’ll list the different actions we can take to modify the graph.

Naming things: branches and tags

Some commits can be annotated: they can have a named label attached to them, that reference a specific commit.

For instance, HEAD references the current commit: your current position in the graph⊕Move around the graph (i.e. move the HEAD pointer), using git checkout. You can give it commit hashes, branch names, tag names, or relative positions like HEAD~3 for the great-grandparent of the current commit.

. This is just a convenient name for the current commit.Much like how . is a shorthand for the current directory when you’re navigating the filesystem.

Branches are other labels like this. Each of them has a name and acts a simple pointer to a commit. Once again, this is simply an alias, in order to have meaningful names when navigating the graph.

In this example, we have three branches: master, feature, and bugfixDo not name your real branches like this! Find a meaningful name describing what changes you are making.

. Note that there is nothing special about the names: we can use any name we want, and the master branch is not special in any way.

Tags⊕Create branches and tags with the appropriately named git branch and git tag.

are another kind of label, once again pointing to a particular commit. The main difference with branches is that branches may move (you can change the commit they point to if you want), whereas tags are fixed forever.

Making changes: creating new commits

When you make some changes in your files, you will then record them in the repo by committing them⊕To the surprise of absolutely no one, this is done with git commit.

. The action creates a new commit, whose parent will be the current commit. For instance, in the previous case where you were on master, the new repo after committing will be (the new commit is in green):

Two significant things happened here:

Your position on the graph changed: HEAD points to the new commit you just created.
More importantly: master moved as well. This is the main property of branches: instead of being “dumb” labels pointing to commits, they will automatically move when you add new commits on top of them. (Note that this won’t be the case with tags, which always point to the same commit no matter what.)

If you can add commits, you can also remove them (if they don’t have any children, obviously). However, it is often better to add a commit that will revert⊕Create a revert commit with git revert, and remove a commit with git reset (destructive!).

the changes of another commit (i.e. apply the opposite changes). This way, you keep track of what’s been done to the repository structure, and you do not lose the reverted changes (should you need to re-apply them in the future).

Merging

There is a special type of commits: merge commits, which have more than one parent (for example, the fifth commit from the left in the graph above).⊕As can be expected, the command is git merge.

At this point, we need to talk about conflicts.⊕See Pro Git’s chapter on merging and basic conflict resolution for the details on managing conflicts in practice.

Until now, every action was simple: we can move around, add names, and add some changes. But now we are trying to reconcile two different versions into a single one. These two versions can be incompatible, and in this case the merge commit will have to choose which lines of each version to keep. If however, there is no conflict, the merge commit will be empty: it will have two parents, but will not contain any changes itself.

Moving commits: rebasing and squashing

Until now, all the actions we’ve seen were append-only. We were only adding stuff, and it would be easy to just remove a node from the graph, and to move the various labels accordingly, to return to the previous state.

Sometimes we want to do more complex manipulation of the graph: moving a commit and all its descendants to another location in the graph. This is called a rebase.⊕That you can perform with git rebase (destructive!).

In this case, we moved the branch feature from its old position (in red) to a new one on top of master (in green).

When I say “move the branch feature”, I actually mean something slightly different than before. Here, we don’t just move the label feature, but also the entire chain of commits starting from the one pointed by feature up to the common ancestor of feature and its base branch (here master).

In practice, what we have done is deleted three commits, and added three brand new commits. Git actually helps us here by creating commits with the same changes. Sometimes, it is not possible to apply the same changes exactly because the original version is not the same. For instance, if one of the commits changed a line that no longer exist in the new base, there will be a conflict. When rebasing, you may have to manually resolve these conflicts, similarly to a merge.

It is often interesting to rebase before merging, because then we can avoid merge commits entirely. Since feature has been rebased on top of master, when merging feature onto master, we can just fast-forward master, in effect just moving the master label where feature is:⊕You can control whether or not git merge does a fast-forward with the --ff-only and --no-ff flags.

Another manipulation that we can do on the graph is squashing, i.e. lumping several commits together in a single one.{-} Use git squash (destructive!).

Here, the three commits of the feature branch have been condensed into a single one. No conflict can happen, but we lose the history of the changes. Squashing may be useful to clean up a complex history.

Squashing and rebasing, taken together, can be extremely powerful tools to entirely rewrite the history of a repo. With them, you can reorder commits, squash them together, moving them elsewhere, and so on. However, these commands are also extremely dangerous: since you overwrite the history, there is a lot of potential for conflicts and general mistakes. By contrast, merges are completely safe: even if there are conflicts and you have messed them up, you can always remove the merge commit and go back to the previous state. But when you rebase a set of commits and mess up the conflict resolution, there is no going back: the history has been lost forever, and you generally cannot recover the original state of the repository.

You can use Git as a simple version tracking system for your own projects, on your own computer. But most of the time, Git is used to collaborate with other people. For this reason, Git has an elaborate system for sharing changes with others. The good news is: everything is still represented in the graph! There is nothing fundamentally different to understand.

When two different people work on the same project, each will have a version of the repository locally. Let’s say that Alice and Bob are both working on our project.

Alice has made a significant improvement to the project, and has created several commits, that are tracked in the feature branch she has created locally. The graph above (after rebasing) represents Alice’s repository. Bob, meanwhile, has the same repository but without the feature branch. How can they share their work? Alice can send the commits from feature to the common ancestor of master and feature to Bob. Bob will see this branch as part of a remote graph, that will be superimposed on his graph: ⊕You can add, remove, rename, and generally manage remotes with git remote. To transfer data between you and a remote, use git fetch, git pull (which fetches and merges in your local branch automatically), and git push.

The branch name he just got from Alice is prefixed by the name of the remote, in this case alice. These are just ordinary commits, and an ordinary branch (i.e. just a label on a specific commit).

Now Bob can see Alice’s work, and has some idea to improve on it. So he wants to make a new commit on top of Alice’s changes. But the alice/feature branch is here to track the state of Alice’s repository, so he just creates a new branch just for him named feature, where he adds a commit:

Similarly, Alice can now retrieve Bob’s work, and will have a new branch bob/feature with the additional commit. If she wants, she can now incorporate the new commit to her own branch feature, making her branches feature and bob/feature identical:

As you can see, sharing work in Git is just a matter of having additional branches that represent the graph of other people. Some branches are shared among different people, and in this case you will have several branches, each prefixed with the name of the remote. Everything is still represented simply in a single graph.

Additional concepts

Unfortunately, some things are not captured in the graph directly. Most notably, the staging area used for selecting changes for committing, stashing, and submodules greatly extend the capabilities of Git beyond simple graph manipulations. You can read about all of these in Pro Git.

Internals

Note: This section is not needed to use Git every day, or even to understand the concepts behind it. However, it can quickly show you that the explanations above are not pure abstractions, and are actually represented directly this way.

Let’s dive a little bit into Git’s internal representations to better understand the concepts. The entire Git repository is contained in a .git folder.

Inside the .git folder, you will find a simple text file called HEAD, which contains a reference to a location in the graph. For instance, it could contain ref: refs/heads/master. As you can see, HEAD really is just a pointer, to somewhere called refs/heads/master. Let’s look into the refs directory to investigate:

$ cat refs/heads/master
f19bdc9bf9668363a7be1bb63ff5b9d6bfa965dd

This is just a pointer to a specific commit! You can also see that all the other branches are represented the same way.You must have noticed that our graphs above were slightly misleading: HEAD does not point directly to a commit, but to a branch, which itself points to a commit. If you make HEAD point to a commit directly, this is called a “detached HEAD” state.

Remotes and tags are similar: they are in refs/remotes and refs/tags.

Commits are stored in the objects directory, in subfolders named after the first two characters of their hashes. So the commit above is located at objects/f1/9bdc9bf9668363a7be1bb63ff5b9d6bfa965dd. They are usually in a binary format (for efficiency reasons) called packfiles. But if you inspect it (with git show), you will see the entire contents (parents, message, diff).

References

Brown, Amy, and Greg Wilson. 2012. The Architecture of Open Source Applications, Volume II. Creative Commons. https://www.aosabook.org/en/index.html.

Table of Contents