Building Git: Part II
So, What good going this week?
We are now going to pick up where we left off. You can read the part I here.
In this part we are going to extend out commit
command functionality, to store the tree
and commit
also.
Extending the commit
commmand
We can store the blob data in our object directory, now we can store trees and commits.
Storing the tree
object
Tree object store the file structure of the repository. The blob object is pretty much stored as a cat-file in a compressed form, the tree object is a different story.
Here is what we will see in the tree file stored on disk:
❯ cat .git/objects/88/e38705fdbd3608cddbe904b67c731f3234c45b | ./inflate.sh
tree 74100644 hello.txt�6%
۩VFJ100644 world.txt��t+$Y$ߙ+\q%
We cannot make anything out of it now, let’s get the hexdump of this data:
❯ cat .git/objects/88/e38705fdbd3608cddbe904b67c731f3234c45b | ./inflate.sh | hexdump -C
00000000 74 72 65 65 20 37 34 00 31 30 30 36 34 34 20 68 |tree 74.100644 h|
00000010 65 6c 6c 6f 2e 74 78 74 00 ce 01 36 25 03 0b a8 |ello.txt...6%...|
00000020 db a9 06 f7 56 96 7f 9e 9c a3 94 46 4a 31 30 30 |....V......FJ100|
00000030 36 34 34 20 77 6f 72 6c 64 2e 74 78 74 00 cc 62 |644 world.txt..b|
00000040 8c cd 10 74 2b ae a8 24 1c 59 24 df 99 2b 5c 01 |...t+..$.Y$..+\.|
00000050 9f 71 |.q|
00000052
We can now see some bits that we can understand. Let’s try to understand it together.
Firstly, we see the entry tree
telling it is a tree
object stored
on the disk, followed by the length of the content ahead, followed by the null byte that we see as a dot(.) and as 00
in the hexdump.
Then comes the file’s it stores, we have hello.txt
and world.txt
, its file mode and file name is stored.
00000000 74 72 65 65 20 37 34 00 31 30 30 36 34 34 20 68 |tree 74.100644 h|
00000010 65 6c 6c 6f 2e 74 78 74 |ello.txt...6%...|
00000020 31 30 30 |....V......FJ100|
00000030 36 34 34 20 77 6f 72 6c 64 2e 74 78 74 00 |644 world.txt..b|
00000040 |...t+..$.Y$..+\.|
00000050 |.q|
00000052
In the above hexdump, we can see the metadata of all the files inside the root
directory, the object type denoted by tree
, its content length 74
bytes,
followed by the null byte denoted by .
, first files mode 100644
and its
name hello.txt
and so forth for all the following files.
Now let’s see what all those garbage value in the decompressed output are?
00000000 |tree 74.100644 h|
00000010 ce 01 36 25 03 0b a8 |ello.txt...6%...|
00000020 db a9 06 f7 56 96 7f 9e 9c a3 94 46 4a |....V......FJ100|
00000030 cc 62 |644 world.txt..b|
00000040 8c cd 10 74 2b ae a8 24 1c 59 24 df 99 2b 5c 01 |...t+..$.Y$..+\.|
00000050 9f 71 |.q|
00000052
In the hexdump output we can notice that all these garbage values are the hash value of the files, that are used to store the file data on the disk. This tells us where to look to get these files content.
Now get to making this.
// ...
func cmdCommitHandler(commit string) error {
// Get all the files in the working directory
allFiles, err := os.ReadDir(gitgo.ROOTPATH)
if err != nil {
return fmt.Errorf("Error reading Dir: %s", err)
}
workFiles := gitgo.RemoveIgnoreFiles(
allFiles,
gitgo.GITGO_IGNORE,
) // Remove the files or dir that are in gitignore
var entries []gitgo.Entries
for _, file := range workFiles {
if file.IsDir() {
continue
}
data ,err := os.ReadFile(file.Name())
if err != nil {
return fmt.Errorf("Error reading file: %s\n%s", file.Name(), err)
}
blobSHA, err := gitgo.StoreBlobObject(data)
entries = append(entries, gitgo.Entries{
Path: file.Name(),
OID: blobSHA,
})
}
// After storing all the blob data
// create the tree entry
treeEntry := gitgo.CreateTreeEntry(entries)
// store the tree data in the .gitgo/objects
treeHash, err := gitgo.StoreTreeObject(treeEntry)
if err != nil {
return err
}
}
After storing all the blob data, its time to store the tree objects data, so we call all the neccessary function to do so. Well, I have structure my program a little bit because it was becoming a pain is the ***, to manage the code a lil
So my new directory structure it this:
.
├── cmd/
│ └── gitgo/
│ ├── main.go
│ └── cmdHandler.go
├── go.mod
├── compress.go
├── database.go
├── global.go
├── hash.go
├── inflate.sh
└── utils.go
Well what I did was only creating a function for the piece of code that was being used more than one time. You can see all the changes I did on my Github.
Now let’s get back to the matter at hand
database.go
package gitgo
import (
"path/filepath"
"encoding/hex"
"bytes"
"fmt"
"os"
)
func StoreBlobObject(blobData []byte) ([]byte, error) {
blobPrefix := fmt.Sprintf(`blob %d`, len(blobData))
blobSHA := getHash(blobPrefix, string(blobData))
blob := getCompressBuf([]byte(blobPrefix), blobData)
hexBlobSha := hex.EncodeToString(blobSHA)
folderPath := filepath.Join(DBPATH, hexBlobSha[:2])
permPath := filepath.Join(DBPATH, hexBlobSha[:2], hexBlobSha[2:])
err := StoreObject(blob, blobPrefix, folderPath, permPath)
if err != nil {
return nil, err
}
return blobSHA, nil
}
func CreateTreeEntry(entries []Entries) bytes.Buffer {
var buf bytes.Buffer
for _, entry := range entries {
input := fmt.Sprintf("100644 %s", entry.Path)
buf.WriteString(input)
buf.WriteByte(0)
buf.Write(entry.OID)
}
return buf
}
func StoreTreeObject(treeEntry bytes.Buffer) (string, error) {
treePrefix := fmt.Sprintf(`tree %d`, treeEntry.Len())
treeSHA := getHash(treePrefix, treeEntry.String())
hexTreeSha := hex.EncodeToString(treeSHA)
fmt.Printf("Tree: %s", hexTreeSha)
tree := getCompressBuf([]byte(treePrefix), treeEntry.Bytes())
folderPath := filepath.Join(DBPATH, hexTreeSha[:2])
permPath := filepath.Join(DBPATH, hexTreeSha[:2], hexTreeSha[2:])
err := StoreObject(tree, treePrefix, folderPath, permPath)
if err != nil {
return "", err
}
return hexTreeSha, nil
}
func StoreObject(
data bytes.Buffer,
prefix, folderPath, permPath string
) error {
err := os.MkdirAll(folderPath, 0755)
if err != nil {
return err
}
// Create a temp file for writing
tName := generateGitTempFileName(".temp-obj-")
tempPath := filepath.Join(folderPath, tName)
tf, err := os.OpenFile(
tempPath,
os.O_RDWR|os.O_CREATE|os.O_EXCL,
0644,
)
if err != nil {
return fmt.Errorf("Err creating temp file: %s", err)
}
defer tf.Close()
// Write to temp file
_, err := tf.Write(data.Bytes())
if err != nil {
return fmt.Errorf("Err writing to temp file: %s", err)
}
// rename the file
os.Rename(tempPath, permPath)
return nil
}
We are creating multiple function for what we are saving on the disk, well the function are pretty much the same, the StoreBlobObject
and StoreTreeObject
do pretty much the same thing, but they have different prefix and data(I will have to look into
it how to make it into one function, buts thats for later). Both function uses the StoreObject
function that stores there
data into the disk, it does the blah blah blah… you get the idea this is what we did in the previous part. The CreateTreeEntry
function creates the data from the slice of the file entries, that will store the file mode, file name, and file’s hash.
compress.go
package gitgo
import (
"bytes"
"compress/zlib"
"slices"
)
func getCompressBuf(prefix, data []byte) bytes.Buffer {
var buf bytes.Buffer
prefix = append(prefix, byte(0))
w := zlib.NewWriter(&buf)
w.Write(slices.Concat(prefix, data))
w.Close()
return buf
}
Here we are compressing the data given to us using the zlib compression method.
hash.go
package gitgo
import (
"crypto/sha1"
"io"
)
func getHash(prefix, data string) []byte {
h := sha1.New()
p := append([]byte(prefix), byte(0))
io.WriteString(h, string(p))
io.WriteString(h, data)
shaCode := h.Sum(nil)
return shaCode
}
Here we are getting the hash of the data.
utils.go
package gitgo
import (
"fmt"
"math/rand"
"os"
"path/filepath"
"strconv"
)
func RemoveIgnoreFiles(input []os.DirEntry, ignore []string) []os.DirEntry {
ignoreMap := make(map[string]bool)
for _, v := range ignore {
ignoreMap[v] = true
}
var result []os.DirEntry
for _, v := range input {
if !ignoreMap[v.Name()] {
result = append(result, v)
}
}
return result
}
func generateGitTempFileName(prefix string) string {
randomInt := (rand.Intn(999999) + 1)
return prefix + strconv.Itoa(randomInt)
}
In the utility file, we are removing the files that are meant to be ignored(defined currently in a global variable with only 3 files or dirs in it). And we are generating the temp file name.
Storing the commit
object
Storing the commit
object is very easy, the output of the printing the commit objects data will look like this:
❯ cat .git/objects/aa/14833e0f4f21ecf6b7e79d2d305b151c1d728f | ./inflate.sh
commit 171tree 88e38705fdbd3608cddbe904b67c731f3234c45b
author Vikuuu <adivik672@gmail.com> 1739463318 +0530
committer Vikuuu <adivik672@gmail.com> 1739463318 +0530
Initial commit
Firstly comes the prefix, in this case commit
and then the length of the content followed by the null byte, you can check that in the hexdump.
Commits are stored as series of headers and then the commit message
-
tree: All commit refer to a single tree that represents the state of your code at this commit, instead of storing the diffs it stores the pointer of snampshot to all the files and dirs data on that commit, we make it space-efficient by using the compression techniques.
-
author: This field is metadata, it contains the name, email and unix timestamp for when it was authored.
-
committer: This is also metadata, often same as author. But may differ in case where somebody writes some changes and then someone else amends the commit, or cherry-picks it onto another branch. Its time reflects the time the commit was actually written, while the author retains the time the content was authored. These distinctions originate from the workflow used by the Linux Kernel, which
git
was originally developed to support.
The commit message is what the committer defined in the commit, telling about what was done in the commit.
The user can give an flag --message
or -m
and then write there commit message inside the quotes, this
approach can be quite limiting in case of multi-line or long line commit message
Other option is to call the command without any flags, then the git will open the file .git/COMMIT_EDITMSG
in a text
editor(usually nano). Then the user writes there messages in that file, saves and then closes that file.
The git
then reads the commit message from this file.
Currently, we will be reading the commit message using the stdin(standard in) file by either echoing the message or using
the cat on the file we stored the commit message in, then piping them
to the commit
command of out gitgo.
What is piping? When we were using the |
in out terminal to
decompress the git data we were using the pipe operator.
When we use the pipe(|) operator, the terminal then gets the
output of the first command and then gives it to the second
command as its input and so on.
pacman -Q | grep discord
In the above command we are getting all the packages installed on the system(i use arch btw…), getting its output and then feeding its output to the grep as the input and searching for the package named discord.
Now, for the case of getting the author and committer detail, git does so by creating a global .gitconfig
file that defines the users name and email. But for now we will get
the users name and email from the unix environment variable.
When ever you start your terminal or system for that matter your operating system defines some environment variables run this command in your terminal
env
You will see bunch of key-value pairs, one might feel familiar to you named PWD
, telling about the present working directory.
So we will firstly export the new environments variable and use them in out code to set the author and committer name, we can do so by using this command
export GITGO_AUTHOR_NAME=test
export GITGO_AUTHOR_EMAIL=test@example.com
By using this if you restart your terminal or system, these environment variables will be lost, so you will have to re-export
them if that is not a problem then thats good(that’s what i did), but for some case you want it to persist you can add these
line to your shells script that start when you log in to your computer or start the terminal, if you are using bash then
in your home dir you will have file .bashrc
add these line at the end of it, save it and restart it. You can find your
shell startup file in the home dir ending with rc.
Lets get to adding the commit
command to be able to store the commit
object data.
cmd/gitgo/cmdHandler.go
package main
// ...
func cmdHandler(commit string) {
// ...
// after saving the tree object
// Here is we are getting the environment variables
// that we imported earlier
name := os.Getenv("GITGO_AUTHOR_NAME")
email := os.Getenv("GITGO_AUTHOR_EMAIL")
// Here we are generating the metadata string
// that we saw how git store the committer and
// author name, email and timestamp
author := gitgo.Author{
Name: name,
Email: email,
Timestamp: time.Now(),
}.New()
// Currently we are reading from the stdin
message := gitgo.ReadStdinMsg()
// creating the data to store on the disk
commitData := gitgo.Commit{
TreeOID: treeHash,
Author: author,
Message: message,
}.New()
// Storing the commit object on the disk
cHash, err := gitgo.StoreCommitObject(commitData)
if err != nil {
return err
}
}
Here we are extending our cmdCommitHandler
by storing the commit object after storing the tree object.
database.go
package gitgo
// ...
type Author struct {
Name string
Email string
Timestamp time.Time
}
type Commit struct {
TreeOID string
Author string
Message string
}
func (a Author) New() string {
unixTimeStamp := a.Timestamp.Unix()
utcOffset := getUTCOffset(a.Timestamp)
return fmt.Sprintf("%s <%s> %d %s", a.Name, a.Email, unixTimeStamp, utcOffset)
}
func (c Commit) New() string {
lines := []string{
fmt.Sprintf("tree %s", c.TreeOID),
fmt.Sprintf("author %s", c.Author),
fmt.Sprintf("committer %s", c.Author),
"",
c.Message,
}
return strings.Join(lines, "\n")
}
func ReadStdinMsg() string {
reader := bufio.NewReader(os.Stdin)
msg, _ := reader.ReadString('\n')
return msg
}
func StoreCommitObject(commitData string) (string, error) {
commitPrefix := fmt.Sprintf(`commit %d`, len(commitData))
commitHash := getHash(commitPrefix, commitData)
commit := getCompressBuf([]byte(commitPrefix), []byte(commitData))
hexCommitHash := hex.EncodeToString(commitHash)
folderPath := filepath.Join(DBPATH, hexCommitHash[:2])
premPath := filepath.Join(DBPATH, hexCommitHash[:2], hexCommitHash[2:])
err := StoreObject(commit, commitPrefix, folderPath, permPath)
if err != nil {
return "", err
}
return hexCommitHash, nil
}
Here we have created some structs to define the author and the commit and the functions associated to them that will
return the metadata that is stored in the commit, which we used in the cmdCommitHandler
function. A function to read
the input that is given from the stdin, here we are reading till we encounter the new line which is currently a limitation.
Then the usual storing the object to the disk.
Now we will create a new file inside out .gitgo
directory named HEAD
that will currently store the commits ID in it.
cmd/gitgo/cmdHandler.go
func cmdCommitHandler(commit string) {
// ...
HeadFile, err := os.OpenFile(
filepath.Join(gitgo.GITPATH, "HEAD"),
os.O_WRONLY|os.O_CREATE,
0644,
)
if err != nil {
return fmt.Errorf("Err creating HEAD file: %s", err)
}
defer HeadFile.Close()
_, err := HeadFile.WriteString(cHash)
if err != nil {
return fmt.Errorf("Err writing to HEAD file: %s", err)
}
fmt.Printf("root-commit %s", cHash)
return nil
}
Now, we are done in this blog with the updates in our commit
command. Let’s check the things we have created.
❯ gitgo init
Initialized empty Gitgo repository in /home/viku/Workspace/personal/go/tests/gitgo-test/.gitgo
❯ echo "This is Initial Commit with gitgo" | gitgo commit
Let’s see the tree structure.
❯ tree .gitgo
.gitgo
├── HEAD
├── objects
│ ├── 33
│ │ └── 1a074da94977ec6f12bf1f4d22a670dd1a84bd
│ ├── b0
│ │ └── 05ad2e88861404f2536b36bd0ef31d51767285
│ ├── cc
│ │ └── 628ccd10742baea8241c5924df992b5c019f71
│ ├── ce
│ │ └── 013625030ba8dba906f756967f9e9ca394464a
│ └── f4
│ └── 805b5f927de718c2bf531ee024c8a81ccc9f86
└── refs
8 directories, 6 files
Getting the tree file data.
❯ cat .gitgo/objects/f4/805b5f927de718c2bf531ee024c8a81ccc9f86 | inflate.sh | hexdump -C
00000000 74 72 65 65 20 31 31 32 00 31 30 30 36 34 34 20 |tree 112.100644 |
00000010 68 65 6c 6c 6f 2e 74 78 74 00 ce 01 36 25 03 0b |hello.txt...6%..|
00000020 a8 db a9 06 f7 56 96 7f 9e 9c a3 94 46 4a 31 30 |.....V......FJ10|
00000030 30 36 34 34 20 69 6e 66 6c 61 74 65 2e 73 68 00 |0644 inflate.sh.|
00000040 b0 05 ad 2e 88 86 14 04 f2 53 6b 36 bd 0e f3 1d |.........Sk6....|
00000050 51 76 72 85 31 30 30 36 34 34 20 77 6f 72 6c 64 |Qvr.100644 world|
00000060 2e 74 78 74 00 cc 62 8c cd 10 74 2b ae a8 24 1c |.txt..b...t+..$.|
00000070 59 24 df 99 2b 5c 01 9f 71 |Y$..+\..q|
00000079
In the output we get our desired output. Now get the commit object data.
❯ cat .gitgo/objects/33/1a074da94977ec6f12bf1f4d22a670dd1a84bd | ./inflate.sh
commit 179tree f4805b5f927de718c2bf531ee024c8a81ccc9f86
author test <test@example.com> 1739721755 +0530
comitter test <test@example.com> 1739721755 +0530
This is Initial Commit with gitgo
Cool, everything is as we expected it to be.
Afterwords
So, in this blog we extended our commit
command to create files for storing the root tree structure and the commit object,
we are no where near what git does but its a good steady start.
In next part we will be working on storing the history.
Code Link: Github
Just know this,
Reinvent the wheel, so that you can learn how to invent wheel
– a nobody