Developer Nation Community

Blog

Back to blog

July 13, 2022

Git Internals Part 2: How does Git store your data?

byPragati VermainTips

In this article, we’ll be learning about the basics of the data storage mechanism for git.

The most fundamental term we know regarding git and data storage is repositories. Let’s first understand what a git repository is and where it stands in terms of data storage in git.

Are you ready to influence the tech landscape? Take part in the Developer Nation Survey and be a catalyst for change. Your thoughts matter, and you could be the lucky recipient of our weekly swag and prizes! Start Here

Repositories

A git repository can be seen as a database containing all the information needed to retain and manage the revisions and history of a project. In git, repositories are used to retain a complete copy of the entire project throughout its lifetime.

Git maintains a set of configuration values within each repository such as the repository user’s name and email address. Unlike the file data or other repository metadata, configuration settings are not propagated from one repository to another during a clone, or fork, or any other duplication operation. Instead of this, git manages and stores configuration settings on a per-site, per-user, and per-repository basis.

Inside a git repository, there are two data structures – the object store and the index. All of this repository data is stored at the root of your working directory inside a hidden folder named .git. You can read more about what’s inside your .git folder here.

As part of the system that allows a fully distributed VCS, the object store is intended to be effectively replicated during a cloning process. The index is temporary data that is private to a repository and may be produced or edited as needed.

Let’s discuss object storage and index in further depth in the next section.

Git Object Types

Object store lies at the heart of the git’s data storage mechanism. It contains your original data files, all the log messages, author information, and other information required to rebuild any version or branch of the project.

Git places the following 4 types of objects in its object store which form the foundation of git’s higher-level data structures:

blobs
trees
commits
tags

Let’s look a bit more about these object types:

Blobs

A blob represents each version of a file. “Blob” is an abbreviation for “binary big object,” a phrase used in computers to refer to a variable or file that may contain any data and whose underlying structure is disregarded by the application.

A blob is considered opaque it contains the data of a file but no metadata or even the file’s name.

Trees

A tree object represents a single level of directory data. It saves blob IDs, pathnames, and some metadata for all files in a directory. It may also recursively reference other (sub)tree objects, allowing it to construct a whole hierarchy of files and subdirectories.

Commits

Each change made into the repository is represented by a commit object, which contains metadata such as the author, commit date, and log message.

Each commit links to a tree object that records the state of the repository at the moment the commit was executed in a single full snapshot. The initial commit, also known as the root commit, has no parents and the following most of the commits have single parents.

A Directed Acyclic Graph is used to arrange commits. For those who missed it in Data Structures, it simply implies that commits “flow” in one way. This is usually just the trail of history for your repository, which might be very basic or rather complicated if you have branches.

Index

The index is a transient and dynamic binary file that describes the whole repository’s directory structure. More specifically, the index captures a version of the general structure of the project at some point in time. The state of the project might be represented by a commit and a tree at any point in its history, or it could be a future state toward which you are actively building.

One of the primary characteristics of Git is the ability to change the contents of the index in logical, well-defined phases. The indicator distinguishes between gradual development stages and committal of such improvements.

How does git monitor object history?

The Git object store is organized and implemented as a storage system with content addresses. Specifically, each item in the object store has a unique name that is generated by applying SHA1 to the object’s contents, returning a SHA1 hash value.

Because the whole contents of an object contribute to the hash value, and because the hash value is thought to be functionally unique to that specific content, the SHA1 hash is a suitable index or identifier for that item in the object database. Any little modification to a file causes the SHA1 hash to change, resulting in the new version of the file being indexed separately.

For monitoring history, Git keeps only the contents of the file, not the differences between separate files for each modification. The contents are then referenced by a 40-character SHA1 hash of the contents, which ensures that it is almost certainly unique.

The fact that the SHA1 hash algorithm always computes the same ID for identical material, regardless of where that content resides, is a significant feature. In other words, the same file content in multiple folders or even on separate machines produces the same SHA1 hash ID. As a result, a file’s SHA1 hash ID is a globally unique identifier.

Every object has an SHA, whether it’s a commit, tree, or blob, so get to know them. Fortunately, they are easily identified by the first seven characters, which are generally enough to identify the entire string.

One fantastic benefit of saving only the content is that if you have two or more copies of the same file in your repository, Git will only save one internally.

Conclusion

In this article, we learned about the two primary data structures used by git to enable data storage, management, and tracking history. We also discussed the 4 types of object types and the different roles played by them in git’s data storage mechanism.

This was all for this article, I hope you find it helpful. These are the fundamental components of Git as we know it today and use on a regular basis. We’ll be learning more about these Git internal concepts in the upcoming articles.
Keep reading. In case you want to connect with me, follow the links below:

LinkedIn | GitHub | Twitter | Dev

Git internals