Git Basics – How Git Saves Your Work

When first learning Git, most folks take a cookbook approach. You "git add" and "git commit" as you move along with your project. Git is a version control system, so you assume your keeping track of changes in your project, but does Git really save the contents of your file at a particular point in time, and can you get it back? Your not completely confident.

I thought it would be helpful to take another look at Git's inner workings from a fresh perspective to give you a little more confidence in using Git.

Here goes. Git is a system that saves the contents of the files in your project. It can reproduce the contents of your files exactly as you originally saved them, exactly. If you had a semicolon on line 10 of your file six versions ago, when you retrieve the file six versions back, you'll find a semicolon on line 10 in the recovered file, exactly as you saved it six versions ago.

How does it save the file's content exactly as you originally saved it? We can thank the National Security Agency. They invented a two-way, unique, content encryption algorithm called "SHA-1" that turns any document's contents into a unique, 40 hex character compressed blob.

It's two-way, because with the SHA-1, 40 character, hex hash you can go the other way, and reproduce the contents of the file.

It is unique, because if you remove that semicolon on line 10, you'll have a completely different SHA-1 value for the contents. The result is we have one hex name, that represents the entire contents of a file that is much easier to manage and track.

It allows Git to quickly determine whether a file has changed by just looking at the name. It also means, since all SHA-1 is calculated the same on every computer, for every user, you can quickly and easily copy, or clone, and move an entire project to another computer, and it will be identical to the original content, exactly.

In Git, these hash numbers are stored in your .git/objects database that you create when you run "git init" to start your database repository.

Let's go over some Git terms and we'll put it all together.

Your "working directory" is where you work, where you edit files, rename your files, make new directories, and develop your project as you work day by day. If you open up a file in an editor and make a change, you've made a change to your working directory.

Your "index" is just one file in your .git directory. It can grow quite large depending on the number of commits. It is a listing of each file you have committed, and the files you have staged, or are ready to commit, in your repository. Each file takes one line in the index that contains the files SHA-1 hash, the directory path, the file name of the file, and its permissions.

When you do a "git add" command, git encrypts the files in your working directory that have changed contents into their new SHA-1 hash, adds the compressed SHA-1 hash to .git/objects store, and adds that file as a line in the index file ready for committing. That file is now in the index, or staged, and ready to be committed. You'll notice I said the files that have changed. Git doesn't need to encrypt files that have not changed. It already has the SHA-1 hash for them.

When you commit, Git really has four objects it tracks that form a hierarchy of objects: a blob, a tree, a commit, and a tag. At the bottom is the blob that is the contents of your file as a SHA1 hash. The tag is just a way to give a particular commit a special name, and is attached to the commit object, which leaves us the tree and commit object to consider.

The tree object is how Git keeps track of file names and directories. There is a tree object for each directory. The tree object points to the SHA-1 blobs, the files, in that directory, and other trees, sub-directories at the time of the commit. Each tree object is encrypted into, you guessed it, a SHA-1 hash of its contents, and stored in .git/objects. The name of the trees, since they are SHA-1 hashes, allow Git to quickly see if there's been any changes to any files or directories by comparing the name to the previous name. Pretty slick.

"Git commit" only works with the index, not your working directory. One of the reasons is the SHA-1 hash of the file is created during "Git add". During the commit, the tree objects are built up, and we end up with one name for an entire tree of blobs and trees, the top tree object, which we'll use with our commit object.

The commit object keeps track of the physical state of the tree at a given point in time, when you commit. It consists of the commit name, a SHA1 hash, the previous commit name, so you can track your history of commits, the SHA-1 of current top tree with all the changes for that commit included, the original author of the initial commit, the person doing the commit, and a comment describing the commit. Remember, there can be more than one editor of a file, other than the original author. The commit object keeps track of both.

We have a hierarchy. The Git commit object points to the previous commit, and the top git tree. The top git tree points to all the tree objects under it and all files. The SHA-1 file names point to the contents. We have a history. You can easily go back and get the contents of a file if you mess up the code really bad in your working directory file by bringing back your last commit of the file, with a simple git checkout "the file", nice, and the reason to use a version control system, like Git.

Committing in Git is about working with the index to make a snapshot of your project at the time of the commit. Since the Git index has the SHA-1 of a file and its path and file name, it is used to compare the previous commit with files ready to commit, and saving anything changed along with all that was not changed by reference to the SHA-1 name.

That's about it for the basics. Hopefully, this will help you have a little better understanding when you read about other Git commands, and what they do, and a little more confidence that your current file is being saved when you do a "Git commit".


Git Basics – How Git Saves Your Work — 2 Comments

  1.  I stand corrected.  Git uses SHA1 as a label, or file name, for a compressed blob, which is your file contents.  Thanks for pointing this out.

  2. SHA-1 is not 2-way.  Think about it.  I can come up with an INFINITELY large set of strings and run them all through the SHA-1 algorithm to generated FIXED SIZE AND THEREFORE FINITE results.  There are many possible documents that result in the same hash, it’s just extremely unlikely that you’d find an example of one in real life.  Going two-ways here would be magic.