Building Git Part 5 | Aditya blog

Building Git: Part V

Git Staging

Hola amigos!

In previous part we updated our code to store the executable files by storing the correct file mode, and the major update in the code we did was storing the proper directory structure, i.e. storing all the sub-directories inside the root directory by firstly storing all the blob files and then creating the merkle tree.

Now, in this part we are going to update our commit command, the main thing I would say. Currently, our code is adding all the files in the directory to the commit, and we have no control over which file we want to add or remove. So we are going to add a new command add to our code, so that we can choose what file we want to add in the commit.

Adding `add` command

Well, how does git do this? Git maintains a file in .git/ directory called index. In this file git stores the file that you have added using the git add command. And then when you run the git commit command, it commits the file from this index file.

Let’s see how the git does it by the example, we have a directory that has two files.

❯ lsd
 file1.txt   file2.txt

Let’s first see what do we have in the index file. Ohhh, we do not have the index file. Okay let’s add a file using the add command, and now print the content of the index file.

❯ cat .git/index
DIRCg4T2o�g4T2o@!����CK)wZ��	file1.txt�'q��p�=%

We are all too familiar with this type of output by now, so let’s get its hexdump.

❯ hexdump -C .git/index
00000000  44 49 52 43 00 00 00 02  00 00 00 01 67 fa 34 54  |DIRC........g.4T|
00000010  32 6f c6 f6 67 fa 34 54  32 6f c6 f6 00 01 03 08  |2o..g.4T2o......|
00000020  00 40 21 16 00 00 81 a4  00 00 03 e8 00 00 03 e8  |.@!.............|
00000030  00 00 00 00 e6 9d e2 9b  b2 d1 d6 43 4b 8b 29 ae  |...........CK.).|
00000040  77 5a d8 c2 e4 8c 53 91  00 09 66 69 6c 65 31 2e  |wZ....S...file1.|
00000050  74 78 74 00 d3 cf a5 27  1a 71 e0 52 fe c9 15 f9  |txt....'.q.R....|
00000060  a9 9c 70 1e f1 af c3 3d                           |..p....=|
00000068

Firstly we are greated with the 12 byte header in the file. The first 4 byte tells us the string named DIRC. The DIRC (44 49 52 43) in the starting of the file, just tells the git or anyone inspecting the file that this is the Directory Cache (gits terminology), that tells that these files are in the staging area, or you can say ready to be committed.

The string is followed by 4-byte number (00 00 00 02), 2 which tells us the version of the git index format, followed by another 4-byte number (00 00 00 01), 1 which tells us the the number entries in the .git/index file or in the staging area, we only have 1 file in the staging area so this also checks out.

Now that we are done with the headers, then comes the time for the data of the entries. The index entry consists of:

64-bit Ctime This tells the Change time of the file. 67 fa 34 54 tells the ctime in Sec, and 32 6f c6 f6 in Nano sec.
64-bit Mtime This tells the Modification time of the file. 67 fa 34 54 tells the mtime in Sec, and 32 6f c6 f6 in Nano sec.
32-bit Device info 00 01 003 08 tells us the information of the device from the file was added.
32-bit Inode number 00 40 21 16 tells us the information of the file inode.
32-bit File mode 00 00 881 a4 translated to (100644) tells us the file mode.
32-bit Uid 00 00 03 e8 tells the user id, 1000 in decimal
32-bit Gid 00 00 03 e8 tells the group id, 1000 in decimal
32-bit File size 00 00 00 00 tells the size of file data, as we do not have any data so it is zero
160-bit SHA-1 object ID e6 9d e2 9b b2 d1 d6 43 4b 8b 29 ae 77 5a d8 c2 e4 8c 53 91, we are too familiar with this.
8-bit Flags 00 09 tells us the file path name size, here it tells us the file path is 9 byte long
File Path 66 69 6c 65 31 2e 74 78 74, 9-byte file name, translated to file1.txt
Null byte terminator and padding After the file path we have a null terminator that ends the detail of all the entries, all the data above us and the null terminator should be the multiple of 8, if it is not then we add more null terminator as padding to make it the multiple of 8.
160-bit SHA-1 hash The remaining 20-bytes acts as a checksum, calcualted using all the above content of the index file

You can read more about this here

One more thing, can you check if you have anything in the objects directory. I’m waiting.

….

Hmmm, we have a blob object there don’t we? That matches the object id that we saw in the index file. Well yes, git stores it in the objects directory when we add them to the staging area. Maybe to make the commiting fast(I don’t know I just want to copy it.)

Now that we know what we want in our index file, let’s start working on the extension. Firstly we are only going to update our code so that we can add one file at a time in our commit.

cmd/gitgo/main.go

func main() {
    // ....
    switch os.Args[1] {
    // ....

    case "add":
        err := cmdAddHandler(os.Args[1])
        if err != nil {
            fmt.Fprintf(os.Stderr, "Error: %s\n", err)
            os.Exit(1)
        }
    }

    // ...
}

cmd/gitgo/cmdHandler.go

func cmdAddHandler(args string) error {
    path := args
    data, err := os.ReadFile(path)
    if err != nil {
        return err
    }
    stat, err := os.Stat(path)
    if err != nil {
        return err
    }

    blob := gitgo.Blob{Data: data}.Init()
    hash, err := blob.Store()
    if err != nil {
        return err
    }

    index := gitgo.NewIndex()
    index.Add(path, hash, stat)
    res, err := index.WriteUpdate()
    if err != nil {
        return err
    }
}

Now with this we have the command, but it does not works yet. I have created a new file named entries.go in the gitgo package and moved the Entries struct any func related to it in this file.

entries.go

package gitgo

import (
    "os"
    "syscall"
)

const (
    maxPathSize = 0xfff
    regularMode = 0100644
    executableMode = 0100755
)

type IndexEntry struct {
    Path string
    Oid string
    Mtime int64
    MtimeNsec int64
    Ctime int64
    CtimeNsec int64
    Dev uint64
    Ino uint64
    Mode int
    Uid uint32
    Gid uint32
    Size int64
    Flags uint32
}

func NewIndexEntry(name, oid string, stat os.FileInfo) *IndexEntry {
    s := stat.Sys().(*syscall.Stat_t)
    flags := min(len(name), maxPathSize)
    var m int
    if stat.Mode()&0111 != 0 {
        m = executbleMode
    } else {
        m = regularMode
    }

    return &IndexEntry{
        Path: name,
        Oid: oid,
        Mtime: s.Mtim.Sec,
        MtimeNsec: s.Mtim.Nsec,
        Ctime: s.Ctim.Sec,
        CtimeNsec: s.Ctime.Nsec,
        Dev: s.Dev,
        Ino: s.Ino,
        Mode: m,
        Uid: s.Uid,
        Gid: s.Gid,
        Size: s.Size,
        Flags: uint32(flags),
    }
}

We have gotten all the metadata for the file and added them to the struct IndexEntry

index.go

package gitgo

import (
    "bytes"
    "crypto/sha1"
    "encoding/binary"
    "encoding/hex"
    "fmt"
    "os"
    "path/filepath"
)

type Index struct {
    // for storing multiple file data, for future progress
    // i guess?
    entries map[string]IndexEntry
    // for evading the race condition like we did with the
    // head file writing
    lockfile *lockFile
}

// Initializing the Index struct with the lockfile for the
// `.gitgo/index` file.
func NewIndex() *Index {
    return &Index{
        entries: make(map[string]IndexEntry),
        lockfile: lockInitialze(filepath.Join(GITPATH, "index"))
    }
}

// Adding the file to the index
func (i *Index) Add(path, oid string, stat os.FileInfo) {
    entry := NewIndexEntry(path, oid, stat)
    i.entries[path] = *entry
}

func (i *Index) WriteUpdate() (bool, error) {
    // getting the lock on the index file
    b, err := i.lockfile.holdForUpdate()
    if err != nil {
        return false, err
    }
    if !b {
        return false, nil
    }

    // make a new buffer and return its pointer
    buf := new(bytes.Buffer)
    writeHeader(buf, len(i.entries))
    for _, entry := range i.entries {
        b, err := writeIndexEntry(entry)
        if err != nil {
            return true, err
        }
        buf.Write(b)
    }

    // getting the hash of the whole content in the
    // index file
    content := buf.Bytes()
    bufHash := sha1.Sum(content)
    buf.write(bufHash[:])

    i.lockfile.write(buf.Bytes())
    i.lockfile.commit()
    return true, nil
}

func writeHeader(buf *bytes.Buffer, entryLen int) error {
    _, err := buf.Write([]byte("DIRC"))
    if err != nil {
        return fmt.Errorf("writing index header: %s", err)
    }

    b := new(bytes.Buffer)
    versionNum := unit32(2)
    entriesNum := unit32(entryLen)
    binary.Write(b, binary.BigEndian, versionNum)
    binary.Write(b, binary.BigEndian, entriesNum)

    _, err = buf.Write(b.Bytes())
    if err != nil {
        return fmt.Errorf("writing index header: %s", err)
    }
    return nil
}

func writeIndexEntry(entry IndexEntry) ([]byte, error) {
	b := new(bytes.Buffer)
	err := binary.Write(b, binary.BigEndian, uint32(entry.Ctime))
	if err != nil {
		return nil, fmt.Errorf("writing ctime: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.CtimeNsec))
	if err != nil {
		return nil, fmt.Errorf("writing ctime nsec: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.Mtime))
	if err != nil {
		return nil, fmt.Errorf("writing mtime: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.MtimeNsec))
	if err != nil {
		return nil, fmt.Errorf("writing mtime nsec: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.Dev))
	if err != nil {
		return nil, fmt.Errorf("writing dev: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.Ino))
	if err != nil {
		return nil, fmt.Errorf("writing ino: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.Mode))
	if err != nil {
		return nil, fmt.Errorf("writing mode: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.Uid))
	if err != nil {
		return nil, fmt.Errorf("writing uid: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.Gid))
	if err != nil {
		return nil, fmt.Errorf("writing gid: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint32(entry.Size))
	if err != nil {
		return nil, fmt.Errorf("writing size: %s", err)
	}
	oid, err := hex.DecodeString(entry.Oid)
	if err != nil {
		return nil, fmt.Errorf("decoding string oid: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, oid)
	if err != nil {
		return nil, fmt.Errorf("writing oid: %s", err)
	}
	err = binary.Write(b, binary.BigEndian, uint16(entry.Flags))
	if err != nil {
		return nil, fmt.Errorf("writing flag: %s", err)
	}
	_, err = b.Write([]byte(entry.Path))
	if err != nil {
		return nil, fmt.Errorf("writing entry path: %s", err)
	}
	err = b.WriteByte(0)
	if err != nil {
		return nil, fmt.Errorf("writing null byte after entry: %s", err)
	}
	missing := (8 - (b.Len() % 8)) % 8
	for range missing {
		b.WriteByte(0)
	}
	return b.Bytes(), nil
}

Yup, the function writeIndexEntry is 80% error checking and returning the proper error. Well you gotta choose your poison man, either you can use try-except or this… I like this, what about you?

Well you might get the gist, we are creating the struct for the index data, we are making use of the lockFile for preventing the race condition as we did with the writing of the HEAD file.

Firstly, we obtain the lock, create a buffer, write headers to the buffer and then iterate over the map of entries, and write all of there metadata to the buffer, after writing all the content we get the SHA-1 hash for all the content in our buffer. And lastly we commit the data to the index file via lockfile commit function, that writes all the data to the file and remove the lock to the file.

Now why don’t we test our code. Let’s add the new file hello.txt with no data. Add the file to the git and gitgo.

git add hello.txt; gitgo add hello.txt

Now compare the hexdump of the .git/index & .gitgo/index.

❯ hexdump -C .git/index; hexdump -C .gitgo/index
00000000  44 49 52 43 00 00 00 02  00 00 00 01 67 f2 5f 12  |DIRC........g._.|
00000010  1e 96 2a 83 67 f2 5f 12  1e 96 2a 83 00 01 03 08  |..*.g._...*.....|
00000020  00 34 1a 1d 00 00 81 a4  00 00 03 e8 00 00 03 e8  |.4..............|
00000030  00 00 00 00 e6 9d e2 9b  b2 d1 d6 43 4b 8b 29 ae  |...........CK.).|
00000040  77 5a d8 c2 e4 8c 53 91  00 09 68 65 6c 6c 6f 2e  |wZ....S...hello.|
00000050  74 78 74 00 01 e7 73 59  af ad 1b a7 93 c4 f2 66  |txt...sY.......f|
00000060  2b 50 30 1b e9 53 c6 de                           |+P0..S..|
00000068

00000000  44 49 52 43 00 00 00 02  00 00 00 01 67 f2 5f 12  |DIRC........g._.|
00000010  1e 96 2a 83 67 f2 5f 12  1e 96 2a 83 00 01 03 08  |..*.g._...*.....|
00000020  00 34 1a 1d 00 00 81 a4  00 00 03 e8 00 00 03 e8  |.4..............|
00000030  00 00 00 00 e6 9d e2 9b  b2 d1 d6 43 4b 8b 29 ae  |...........CK.).|
00000040  77 5a d8 c2 e4 8c 53 91  00 09 68 65 6c 6c 6f 2e  |wZ....S...hello.|
00000050  74 78 74 00 01 e7 73 59  af ad 1b a7 93 c4 f2 66  |txt...sY.......f|
00000060  2b 50 30 1b e9 53 c6 de                           |+P0..S..|
00000068

Adding multiple files in `index`

For now let’s update our add handlers functionality, to add multiple files given to it as the args list, then later we can make it to update the index file in the increaments as the user wants.

For this we want to store the file provided to us in the sorted order. In our current implementation we are getting only a single file as the input to our add handler, but now we want to extend this capability and be able to add multiple files. And the important thing is this, git stores these files in the sorted order. So want a way to add mutiple files and sort them at the same time. For this purpose we can make use of the Sorted Set data structure. As we are building this in Go we do not have built-in Sorted Set, so we will have to create it our self. I have implemented the Sorted Set in this blog so I want be mentioning about it here.

Let’s update our structure.

index.go

  type Index struct {
  	entries  map[string]IndexEntry
+	keys     *datastr.SortedSet
  	lockfile *lockFile
  }

  func NewIndex() *Index {
  	return &Index{
  		entries:  make(map[string]IndexEntry),
+       keys:     datastr.NewSortedSet(),
  		lockfile: lockInitialize(filepath.Join(GITPATH, "index")),
  	}
  }

  func (i *Index) Add(path, oid string, stat os.FileInfo) {
  	entry := NewIndexEntry(path, oid, stat)
 +	i.keys.Add(path)
  	i.entries[path] = *entry
  }

  func (i *Index) WriteUpdate() (bool, error) {
  	b, err := i.lockfile.holdForUpdate()
  	if err != nil {
  		return false, err
  	}
  	if !b {
  		return false, nil
  	}

  	buf := new(bytes.Buffer) // makes a new buffer and returns its pointer
  	writeHeader(buf, len(i.entries))
 +	it := i.keys.Iterator()
 +	for it.Next() {
 +		path := it.Key()
 +		entry := i.entries[path]
 +		data, err := writeIndexEntry(entry)
 +		if err != nil {
 +			return true, err
 +		}
 +		buf.Write(data)
  	}

  	// getting the hash of the whole content in the
  	// index file
  	content := buf.Bytes()
  	bufHash := sha1.Sum(content)
  	buf.Write(bufHash[:])

  	i.lockfile.write(buf.Bytes())
  	i.lockfile.commit()
  	return true, nil
  }

We only update a little bit of code in our index.go to accomodate the new criteria. We will add the files that are given to us and then add them to the sorted set, it will be added in the sorted order (our implementation will take care of it), and then we loop over the set and write the data to the index file.

Now we will have to update our add handler accordingly, to store multiple files if provided to it.

cmd/gitgo/cmdHandler.go

func cmdAddHandler(args []string) error {
	index := gitgo.NewIndex()
	for _, path := range args {
		data, err := os.ReadFile(path)
		if err != nil {
			return err
		}
		stat, err := os.Stat(path)
		if err != nil {
			return err
		}

		blob := gitgo.Blob{Data: data}.Init()
		hash, err := blob.Store()
		if err != nil {
			return err
		}

		index.Add(path, hash, stat)
	}
	res, err := index.WriteUpdate()
	if err != nil {
		return err
	}

	if res {
		fmt.Println("Written data to Index file")
	}

	return nil
}

Now with this we are done, with the first extension of the add handler to store multiple file paths if provided.

Storing the given folder

Now we want to update our code, so that if the user only gives us the folder, then we take that folder and expand it to contain all the files that are present inside the index file. So that if the user gives us the input like gitgo add ., we can take this and add all the files, that are inside the root directory that has the .gitgo folder.

For this we will have to update our ListFiles function in the files.go file.

files.go

func ListFiles(dir string) ([]string, error) {
    var workfiles []string

    err := filepath.WalkDir(dir, func(path string, d fs.DirEntry, err error) error {
        if err != nil {
            return err
        }

        // check if the given dir path is file or directory?
+       s, err := os.Stat(path)
+       if !s.IsDir() {
+           relPath, err := filepath.Rel(dir, path)
+           if err != nil {
+               return err
+           }
+           if relPath == "." {
+               relPath, err = filepath.Rel(ROOTPATH, path)
+               if err != nil {
+                   return err
+               }
+               workfiles = append(workfiles, relPath)
+               return nil
+           }
+       }
        name := filepath.Base(path)
        // skip the files or directories found in the ignore hashmap
        if _, found := g_ignore[name]; found {
            if d.IsDir() {
                return filepath.SkipDir
            }
            return nil
        }

        // Append only files, not directories
        if !d.IsDir() {
+           relPath, err := filepath.Rel(ROOTPATH, path)
+           if err != nil {
+               return err
+           }
            workfiles = append(workfiles, relPath)
        }
        return nil
    })
    if err != nil {
        return nil, err
    }
    return workfiles, nil
}

Now let’s update the cmdHandler to reflect the changes we made in our add functionality.

cmd/gitgo/cmdHandler.go

func cmdAddHandler(args []string) error {
    index := gitgo.NewIndex()
    for _, path := range args {
+       absPath, err := filepath.Abs(path)
+       if err != nil {
+           return err
+       }
+
+       expandPaths, err := gitgo.ListFiles(absPath)
+       if err != nil {
+           return err
+       }
+
+       for _, p := range expandPaths {
+           ap, err := filepath.Abs(p)
+           if err != nil {
+               return err
+           }

+           data, err := os.ReadFile(ap)
            if err != nil {
                return err
            }
+           stat, err := os.Stat(ap)
            if err != nil {
                return err
            }

            blob := gitgo.Blob{Data: data}.Init()
            hash, err := blob.Store()
            if err != nil {
                return err
            }

+           index.Add(p, hash, stat)
        }
    }
    res, err := index.WriteUpdate()
    if err != nil {
        return err
    }

    if res {
        fmt.Println("Written data to Index file")
    }
    return nil
}

And now we are done with our basic implementation of the add handler. Now we can add multiple files given to us as arguments or a folder given to us as an arguments.

Afterwords

We have added the add command, that puts the selected files in the staging area(aka, git’s index file).
We are able to add multiple files given as an arguments.
We are able to add all the files inside the give folder path as an arguments.

In then next part we will be working on udpating the add command, so that it can add files to the staging area, in an incremental way.

Code Link: Github

Just know this,

Reinvent the wheel, so that you can learn how to invent wheel

– a nobody

Building Git: Part V

Adding add command

Adding multiple files in index

Storing the given folder

Afterwords

Adding `add` command

Adding multiple files in `index`