Building Git : Part I
Recently, I’ve been tackling with the dilema of what to create for my project as a source of learning. Between the many choices I decided to go with the Git (the one linus built in 5 days, let’s see how much it will take me?). For this I won’t be shooting the bullet in the dark, but trying to navigate my way through the book, Building Git by James Coglan. This book explains the git functionality and codes it in Ruby, but I will be building it in Go programming language, so I will have to convert it from ruby to go, but ruby being ruby seeing it just currently seems to me like pseudo-code (almost like magic), that’s good for me I will have to take the psedo-code and the concepts mentioned in the books and convert it into the go code, for now I won’t be going to optimised it or any thing like that, I will do so after completing the book and making a working clone of git, then I will go into the refactoring and making it more idomatically Go like.
Git Basics
I will be going on the basics as the feature I will be working at, this will also work as a revision for me { that’s great, I will have to re-read the book yeeeeee.. ;( }.
So what the hell is git?
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
Git is easy to learn and has a tiny footprint with lightning fast performance. It outclasses SCM tools like Subversion, CVS, Perforce, and ClearCase with features like cheap local branching, convenient staging areas, and multiple workflows.
– git
Hmmm… so git helps you to manage your codebase weather small or large. It helps you create different versions of your code so that you can understand you have done previously, and where you are going currently. It would have been great if I knew about the git in my school when I had to make the final project, it was a whole lot of pain maintaning the version of my code and know which one is the latest version. My naming was link final_project, finalest_project, new_final_project, and so on.
In this post, we will build the init
& commit
command. For this we will have to know what git does when these
commands are used by the user.
git init
User uses the init command for initializing the .git
folder in their repository for versioning that repository. If no
parameter is provided then it creates the .git folder in the directory it was called from, or if you give it the directory
path then it will create it in that folder. When we initialises the .git directory, we are greeted with this structure
.
├── .git
├── README.md
└── somefile.txt
We have .git folder in own repository, the dot(.) in the name makes it a hidden folder. If we inspect the tree structure of .git we will see
.
└── .git/
├── config
├── description
├── HEAD
├── hooks
├── info
├── objects
└── refs
We will ignore most of these(i don’t know what they do yet!), for now we will focus on .git/objects
. This is the directory that acts as the database of the git, it stores all the data of all the files, directory and there content of it, and how it does that? We will know that by doing.
Initializing the git repo
I will be calling mine project gitgo (no creativity… boring.). You can structure your project how ever you want.
// main.go
package main
import (
"flag"
"fmt"
)
func main() {
// defining the flag of our cli
init := flag.String("init", "", "Create .gitgo files in directory")
flag.Parse()
// it means that the user entered only: `gitgo` without
// any flag or parameters
if len(os.Args) < 2 {
flag.Usage()
return
}
switch os.Args[1] {
case "init":
err := cmdInitHandler(*init)
if err != nil {
fmt.Fprintf(os.Stderr, "Error: %s", err)
os.Exit(1)
}
}
}
// cmdHandler.go
package main
import (
"path/filepath"
"os"
)
func cmdInitHandler(initPath string) error {
// folders to create inside .gitgo folder
gitFolder := []string{"objects"}
// this will give us the working directory from
// which the cli tool was called
wd, err := os.Getwd()
if err != nil {
return err
}
// this means that the user has given us the dir name
// or path (now that I am seeing this I know an issue
// with it.)
if len(initPath) > 0 {
wd = filepath
}
gitPath := filepath.Join(wd, ".gitgo")
// creating all the dir defined in the gitFolder slice
for _, folder := range gitFolders {
err = os.MkdirAll(filepath.Join(gitPath, folder), 0755)
if err != nil {
return err
}
}
fmt.Fprintf(os.Stdout, "Initialized empty gitgo repository in %s", gitPath)
return nil
}
Most of the code is pretty self explainotary, and I have added the comments to explain. The problem that I mentioned in the comment is that I am not checking the user input if the directory name and path given by user is correct or not, for now we will make an assumption that it will be correct(finger crossed.)
This much is the easy part. Now is the part where you have to use your brains a lil.
Commit command
Now we will start building the git commit
command, where all the data of the repository is stored in the
.gitgo/objects
directory. For now we will only be storing the meta-data and the data of all the repo files.
So what do we need? How does git does it?
When we run the git commit
command, git tries to take a snapshot of our current folder structure and all the data inside
them. But doing so can make the repository size increase, so we takle it by compressing the data in the files using zlib
compression method.
Git stores the content of the file in the compressed form in the format, if we were to decompress it we will see this
blob %d\0{now the content of the file in compressed form}
The first world tells us the it is a file or blob, %d means it tells us the length of the content of the file followed by the null byte indicating the end of the metadata and now the actual content is displayed.
Lets say we have a file hello.txt, and it contains only a word “hello” in it. When we see the content of the gits object file for out hello.txt we will see something like this
$ cat .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
xKOR0c
This is some gibberish that we see (what is that command, we will see), lets create a script to decompress the data
#!/bin/bash
python3 -c "import sys, zlib; sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))"
Now run this with the getting the file content.
$ cat .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
blob 6hello
Now we can make some sense of it, we see blob telling us its type, a number 6 telling the length of the file content, and then the file’s actual content. But where is the null byte that we cannot see in the ASCII representation, but belive me its there.
Now what is this $ cat .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
, more specifically this ce/013…., well
this is the SHA1 hash of the file we created when we ran the commit
command. The hash is calculated from the content of
the file.
What is hash?
Hash or more specifically hash function, is a function that takes an input and gives and output, but in our case of SHA1 the output will always be a 20 byte hexadecimal string. So you can give it any variable length string and it will always spit out the 20 byte hexadecimal string. One more benefit of this is for the same content it will always give the same output, meaning if the content of the two inputs are same then the output of those will be same, this can tell us if the content of the file are changed or not.
The structure is like this,
.
└── .git/
└── objects/
└── ce/
└── 013625030ba8dba906f756967f9e9ca394464a
The first two letter of the hash are used as the folder, and the remaining are used as the file’s name.
Now, let’s get to building.
// main.go
// ...
func main () {
// ...
case "commit":
err := cmdCommitHandler()
if err != nil {
fmt.Fprintf(os.Stderr, "Error: %s", err)
os.Exit(1)
}
// ...
}
In the main.go
we added a new case in our switch statement for the commit
command. Added a new function cmdCommitHandler
that will take care of all the commit
functionality.
// cmdHandler.go
GITGO_IGNORE = []string{".", "..", ".gitgo"}
func cmdCommitHandler() error {
// getting the path from where the command is called
rootPath, err := os.Getwd()
if err != nil {
return fmt.Errorf("Error getting pwd: %s", err)
}
gitPath := filepath.Join(rootPath, ".gitgo")
dbPath := filepath.Join(gitPath, "objects")
// Get all the files in the working directory
allFiles, err := os.ReadDir(rootPath)
if err != nil {
return fmt.Errorf("Error reading Dir: %s", err)
}
workFiles := removeIgnoreFiles(
allFiles,
GITGO_IGNORE,
) // Remove the files or Dir that are in ignore
for _, file := range workFiles {
// Currently ignoring the directories
if file.IsDir() {
continue
}
// Reading the files content
data, err := os.ReadFile(file.Name())
if err != nil {
return fmt.Errof("Error reading file: %s\n%s", file.Name(), err)
}
// this is the blob prefix that will be added
blobPrefix := fmt.Sprintf(`blob %d`, len(data))
// getting the SHA-1
blobSHA := GetObjectHash(blobPrefix, string(data))
blob := GetCompressBuf([]byte(blobPrefix), data, byte(0))
hexBlobSha := hex.EncodeToString(blobSHA)
err = os.MkdirAll(filepath.Join(dbPath, hexBlobSha[:2]), 0755)
if err != nil {
return fmt.Errorf("Error creating Dir: %s", err)
}
// Create a temp file for writing
tName = GenerateGitTimeFileName(".time-obj-")
tempPath := filepath.Join(dbPath, hexBlobSha[:2], tName)
tf, err := os.OpenFile(
tempPath,
os.O_RDWR|os.O_CREATE|os.O_EXCL,
0644,
)
defer tf.Close()
if err != nil {
return fmt.Errorf("Err creating tempFile: %s", err)
}
// Write to the temp file
_, err := tf.Write(blob.Bytes())
if err != nil {
return fmt.Errorf("Err writing to temp file: %s", err)
}
// Rename the file
permPath := filepath.Join(dbPath, hexBlobSha[:2], hexBlobSha[2:])
os.Rename(tempPath, permPath)
}
}
func removeIgnoreFiles(input []os.DirEntry, ignore []string) []os.DirEntry {
ignoreMap := make(map[string]bool)
for _, v := range ignore {
ignoreMap[v] = true
}
var res []os.DirEntry
for _, v := range input {
if !ignoreMap[v.Name()] {
res = append(res, v)
}
}
return res
}
func GetObjectHash(blobPrefix, data string) []byte {
h := sha1.New()
var prefix []byte
prefix = append([]byte(blobPrefix), byte(0))
io.WriteString(h, prefix)
io.WriteString(h, string(data))
shaCode := h.Sum(nil)
return shaCode
}
func GetCompressBuf(prefix, data []byte, nullByte byte) bytes.Buffer {
var buf bytes.Buffer
prefix = append(prefix, nullByte)
w := zlib.NewWriter(&buf)
w.Write(slices.Concat(prefix, data))
w.Close()
return buf
}
func GenerateGiTempFileName(prefix string) string {
randomInt := (rand.Intn(999999) + 1)
return prefix + strconv.Itoa(randomInt)
}
Phewwww… that was long(that’s what she said). Now let’s go over the code we have written.
Firstly, we get all the paths that will be required for us.
Using the path, we get all the files, dirs in that path (I think this is not secure as the user can run the command from anywhere so we will have to create a check as
this is the correct repository or path and what not, but
that’s not for now :) ). We remove all the files, dirs that
are present in the GITGO_IGNORE
slice.
Then we loop over all the files & dirs in the workFiles
slice.
For now we are skipping the directories, then we read the files data and store it, creating a blob prefix, then
getting the hash from the blob prefix, null byte and file data.
This hash will give us the dir and file name, first 2 character for folder name, and then remaining for file name.
We will create a temporary file and store the compressed data
in it, so that no one can access it while the commit
command
is working.
After writing to the temp file we rename it to the permanent file name and voila… We are done for now.
Afterword
In this blog we created two git command
gitgo init
gitgo commit
These command currently do not have all the functionality that is required for the working git
clone.
But we will get there one step at a time.
In the next post we will try to further enhance these command or create a new command.
You can find my code here: gitgo
Just know this,
Reinvent the wheel, so that you can learn how to invent wheel
– nobody