Building Git: Part IV
Yo what’s good, guys? What amazing project you guys are working on?
In the previous part we built the link to the past, aka commit history.
Currently, in our implementation we are just storing only the single
tree (or the root directory) but the git also stores all the sub-directories
in the root directory or the directory in which the .git
file has been
initialized.
Now, we are going to update our code to store all the sub-directories in the root tree.
Executable files
Currently, we are just storing the file mode as 100644
, which is the default for
all the normal files, but what about the files that are not normal? Like exectuable
complied file created by running go build
. These files have different file mode
mainly 100755
.
The starting 10
tells us that it is a file, and the remaining part tells us the
permission of the files for the different users. If you want to understand the
file permission you can read my Understanding Linux filesystem.
Storing Executable files in tree
Now let’s get to storing the file modes.
database.go
import (
"syscall"
)
// Updating the Entries struct
type Entries struct {
Path string
OID []byte
Stat string // New field for storing file mode
}
func CreateTreeEntry(entries []Entries) bytes.Buffer {
var buf bytes.Buffer
for _, entry := range entries {
// Update this line, now storing dynamic filemode
input := fmt.Sprintf("%s %s", entry.Stat, entry.Path)
buf.WriteString(input)
buf.WriteByte(0)
buf.Write(entry.OID)
}
return buf
}
// New function to get the file mode
func FileMode(file os.DirEntry) (uint32, error) {
f, err := os.Stat(file.Name())
if err != nil {
return 0, err
}
// getting the stat from the underlying syscall
// it will only work for unix like operating systems
// but we can make it work for the windows, by checking
// the underlying operating system. But that for later
// me.
stat := f.Sys().(*syscall.Stat_t)
return stat.Mode, nil
}
cmdHandler.go
import (
"strconv"
)
func cmdCommitHandler(commit string) error {
// Get all the files in the working directory
allFiles, err := os.ReadDir(gitgo.ROOTPATH)
if err != nil {
return fmt.Errorf("Error reading Dir: %s", err)
}
workFiles := gitgo.RemoveIgnoreFiles(
allFiles,
gitgo.GITGO_IGNORE,
) // Remove the files or Dir that are in ignore.
var entries []gitgo.Entries
for _, file := range workFiles {
if file.IsDir() {
continue
}
data, err := os.ReadFile(file.Name())
if err != nil {
return fmt.Errorf("Error reading file: %s\n%s", file.Name(), err)
}
fileMode, err := gitgo.FileMode(file)
if err != nil {
return err
}
blobSHA, err := gitgo.StoreBlobObject(data)
entry := gitgo.Entries{
Path: file.Name(),
OID: blobSHA,
Stat: strconv.FormatUint(uint64(fileMode), 8),
}
entries = append(entries, entry)
}
// .....
With this, now we can store the executable files in our tree structure.
Nested tree
Firstly, we will have to see how git stores the sub-direcotries.
Lets make a new directory to test this out.
mkdir test && cd test
echo "this is file1" > file1.txt
mkdir folder1
echo "this is file2" > folder1/file2.txt
mkdir folder1/folder2
echo "this is file3" > folder1/folder2/file3.txt
Now commit this to git, you will get a root hash.
❯ git cat-file -p 6eab21e
tree 314adb2b05c2d64911655eff66cf5c9d381a5a4c
author Vikuuu <adivik672@gmail.com> 1743399030 +0530
committer Vikuuu <adivik672@gmail.com> 1743399030 +0530
initial commit
Now cat the tree file.
❯ git cat-file -p 314adb2b05c2d64911655eff66cf5c9d381a5a4c
100644 blob 433eb172726bc7b6d60e8d68efb0f0ef4e67a667 file1.txt
040000 tree 7662ba3434fd7f48ad6d1df1c7501498631bfd74 folder1
Hmmm, so we can see here that we have a blob, and a tree in our root tree, which is the
same as we created our directory, that has a file and a folder. Now lets see inside the folder1
tree.
❯ git cat-file -p 7662ba3434fd7f48ad6d1df1c7501498631bfd74
100644 blob f138820097c8ef62a012205db0b1701df516f6d5 file2.txt
040000 tree 29200651ef2c4956cf3d6d04d164570c43966781 folder2
We get the similar output, that is identical to our directory. the folder1
directory has a single file and a folder.
Now lets see inside the folder2
.
❯ git cat-file -p 29200651ef2c4956cf3d6d04d164570c43966781
100644 blob a309e46e332a0f166453c6137344852fab38d120 file3.txt
Now we only get a blob object, because we only have a single file in folder2
As we saw in the above examples, the tree blob stores the another tree hash, for the sub-directories and then that tree stores the hash for the blob objects, this tree structure in called Merkle tree, in merkle tree the parent holds the hash of its children.
Building a Merkle tree
So firstly, we want to get all the files, files inside the folders, and so on. And we want our output to look something like this,
{
"file1.txt",
"file2.txt",
"folder1/file3.txt",
"folder2/folder3/file4.txt",
"folder2/folder3/file5.txt",
}
We want the flatten list of all the files, and files inside any subdirectories.
database.go
var g_ignore = map[string]bool{
".": true,
"..": true,
".gitgo": true,
}
files.go
package gitgo
import (
"fmt"
"io/fs"
"path/filepath"
)
func ListFiles(dir string) ([]string, error) {
var workfiles []string
err := filepath.WalkDir(dir, func(path string, d fs.DirEntry, err error) error {
if err != nil {
return err
}
name := filepath.Base(path)
if _, found := g_ignore[name]; found {
if d.IsDir() {
return filepath.SkipDir
}
return nil
}
// Append only files, not directories
if !d.IsDir() {
relPath, _ := filepath.Rel(dir, path)
workfiles = append(workfiles, relPath)
}
return nil
})
if err != nil {
return nil, err
}
return workfiles, nil
}
Here we are using the WalkDir
function from the path/filepath
standard library, to walk all the files and folders inside
the root directory provided. and then just adding there relative name,
with respect to the root directory path provided, here we are only
adding the files, so that we get only the file paths.
Now we can change our cmdHandler
function to first store all the blob objects
to the disk using this flatten list returned to it, and then try to create a tree
structure from the given flatten list, and then store all the sub-tree object
and then lastly the root tree object onto the disk.
cmd/gitgo/cmdHandler.go
func cmdCommitHandler(_ string) error {
rootPath := gitgo.ROOTPATH
// storing all the blobs first
entries, err := gitgo.StoreOnDisk(rootPath)
if err != nil {
return err
}
// build merkle tree, and store all the
// subdirectories tree file
tree := gitgo.BuildTree(entries)
e, err := gitgo.TraverseTree(tree)
if err != nil {
return err
}
// now store the root tree
rootTree := gitgo.TreeBlob{Data: gitgo.CreateTreeEntry(e)}.Init()
treeHash, err := rootTree.Store()
if err != nil {
return err
}
// storing commit object
// ....
}
In this we laid out the structure, now build how we are goint to accomplish this.
Let move the Entries
struct to a new file
entries.go
package gitgo
type Entries struct {
Path string
OID string
Stat string
}
func NewEntry(name, oid, stat string) *Entries {
return &Entries{Path: name, OID: oid, Stat: stat}
}
database.go
Lets change the name of our previous Tree
struct
-type Tree struct {
+type TreeBlob struct {
Prefix string
Data bytes.Buffer
}
-func BlobInitialize(data []byte) *Blob {
- prefix := fmt.Sprintf(`blob %d`, len(data))
- return &Blob{
- Prefix: prefix,
- Data: data,
- }
+func (b Blob) Init() *Blob {
+ prefix := fmt.Sprintf(`blob %d`, len(b.Data))
+ b.Prefix = prefix
+ return &b
}
-func TreeInitialize(data bytes.Buffer) *Tree {
- prefix := fmt.Sprintf(`tree %d`, data.Len())
- return &Tree{
- Prefix: prefix,
- Data: data,
- }
+func (t TreeBlob) Init() *TreeBlob {
+ prefix := fmt.Sprintf(`tree %d`, t.Data.Len())
+ t.Prefix = prefix
+ return &t
}
-func (b *Blob) Store() ([]byte, error) {
+func (b *Blob) Store() (string, error) {
return StoreBlobObject(b.Data, b.Prefix)
}
-func (t *Tree) Store() (string, error) {
+func (t *TreeBlob) Store() (string, error) {
return StoreTreeObject(t.Data, t.Prefix)
}
-func StoreBlobObject(blobData []byte, prefix string) ([]byte, error) {
+func StoreBlobObject(blobData []byte, prefix string) (string, error) {
@@ -129,10 +126,10 @@ func StoreBlobObject(blobData []byte, prefix string) ([]byte, error) {
permPath := filepath.Join(DBPATH, hexBlobSha[:2], hexBlobSha[2:])
err := StoreObject(blob, prefix, folderPath, permPath)
if err != nil {
- return nil, err
+ return "", err
}
- return blobSHA, nil
+ return hexBlobSha, nil
}
func StoreCommitObject(commitData, prefix string) (string, error) {
-func FileMode(file os.DirEntry) (uint32, error) {
- f, err := os.Stat(file.Name())
+func FileMode(file string) (uint32, error) {
+ f, err := os.Stat(file)
if err != nil {
return 0, err
}
After all the patching, lets write the code for abstracting the storage of the blob object,
database.go
func StoreOnDisk(path string) ([]Entries, error) {
// returns the flatten list of files in root directory
files, err := ListFiles(path)
if err != nil {
return nil, err
}
var entries []Entries
for _, f := range files {
err = blobStore(f, &entries)
if err != nil {
return nil, err
}
}
return entries, nil
}
func blobStore(f string, entries []Entries) error {
fp := filepath.Join(ROOTPATH, f)
data, err := os.ReadFile(fp)
if err != nil {
return err
}
blob := Blob{Data: data}.Init()
fileMode, err := FileMode(fp)
if err != nil {
return err
}
hash, err := blob.Store()
entry := Entries{
Path: f,
OID: hash,
Stat: strconv.FormatUint(uint64(fileMode), 8),
}
*entries = append(*entries, entry)
return nil
}
With this now we firstly store all the blob object in our repository.
Now, let’s build the tree structure from the flatten list and then store all the tree structure.
tree.go
package gitgo
import (
"bytes"
"encoding/hex"
"errors"
"fmt"
"log"
"sort"
"strings"
)
type Node interface{}
type Tree struct {
Nodes map[string]Node
}
func NewTree() *Tree {
return &Tree{Nodes: make(map[string]Node)}
}
func BuildTree(entries []Entries) *Tree {
root := NewTree()
for _, entry := range entries {
parts := strings.Split(entry.Path, "/")
current := root
// Walk through each part of the path.
for i, part := range parts {
// Last part, this is file entry
}
if i == len(parts)-1 {
current.Nodes[part] = NewEntry(part, entry.OID, entry.Stat)
} else {
// Intermediate directory, see if a Tree already exists.
if node, ok := current.Nodes[part]; ok {
if subtree, ok := node.(*Tree); ok {
current = subtree
} else {
// conflict, an entry already exists where a directory is expected.
log.Printf("%q already exists as a file; cannot create directory\n" part)
break
}
} else {
// Create a new tree for this directory
newTree := NewTree()
current.Nodes[part] = newTree
current = newTree
}
}
}
}
func TraverseTree(tree *Tree) ([]Entries, error) {
entry := []Entries{}
for name, node := range tree.Nodes {
switch n := node.(type) {
case *Tree:
e, err := TraverseTree(n)
if err != nil {
return nil, err
}
tree := TreeBlob{Data: CreateTreeEntry(e)}.Init()
hash, err := tree.Store()
if err != nil {
return nil, err
}
ne := Entries{Path: name, OID: hash, Stat: "040000"}
entry = append(entry, ne)
case *Entries:
entry = append(entry, *n)
default:
return nil, errors.New("unknown data type")
}
}
return entry, nli
}
Lets move the CreateTreeEntry
function from the database.go
to the tree.go
tree.go
func CreateTreeEntry(entries []Entries) bytes.Buffer {
var buf bytes.Buffer
sort.Slice(entries, func(i, j int) bool {
return entries[i].Path < entries[j].Path
})
for _, entry := range entries {
input := fmt.Sprintf("%s %s", entry.Stat, entry.Path)
buf.WriteString(input)
buf.WriteByte(0)
rawOID, err := hex.DecodeString(entry.OID)
if err != nil {
log.Printf("converting oid to raw: %s, for: %s", err, entry.Path)
}
buf.Write(rawOID)
}
return buf
}
In our CreateTreeEntry
function we added the sorting function,
to first sort all the entries, and then write them to the disk,
because git
also firstly sorts the directory and then stores them
to the disk, this help in know if there is any change in the directory
structure or not.
Design choice
Why are we first creating a flattened list, storing all the blobs and then recreating the tree from that flatten list and then storing the tree? Why can’t we store the blobs and tree object on disk as we traverse the directory and do it in one go?
Well, I just blindly followed the book…jk. I was also thinking about this and trying to be a smart ass. So, I also thought about this why not do it in a single time, and use the dfs and recursion, for this. So I set out on a journey to do so, and just to find no end :(.
Well I could not implement it, it was wasting my time, and the book also said that this is a good design choice as we are not intermingling the storing of blob, and tree object.
So Here you have the answer for this question
Afterwords
So, in this blog we extended our git
implementation, we
can now store the sub-directories and not be limited to the single root directory structure.
And we also able to store the executable files in the tree object
In the next part we will be working on creating an Index.
Code Link: Github
Just know this,
Reinvent the wheel, so that you can learn how to invent wheel
– a nobody