Building Git: Part III
Yokosho watashino, part 3 of Building git. In the previous parts we build the gitgo
capable of storing the flat
directory structure and the files inside it. But having only single commit in the version control system, is not very
useful. We have no relationships between the commits as of now, and it is not helpfull to have all the commits
disoriented, with no way of going back to parent commit. Our first goal was to just commit the contents to get started
with the miniature version of the git
.
Now, we will be working to make history in this part. Let’s get to it.
The parent
field
The git
adds the new parent
field in the commit object
on the every subsequent commits to make a history, and to be able to trace all the history. Only the root-commit object
does not have the parent
field in the commit object, but every following commits contains a parent
field telling git
where this commit
inherits from, for the case of merge
the commit has two parents. The git
uses this to sort all the
commit
and make a tree structure. Git do not uses the timestamp for sorting all the commits as it will be harder,
and not feasable when working with the team, as two people can make a commit
at same time, so timestamp it not a
good parameter to sort the commits. For this we use the parent
field, if a user wants the latest 5 commits, we can
easily show the user all the commits by getting the latest commit from the .git/HEAD
and then tracing the parent commit
of all the commit
till we have 5 commits
.
It is more robust to put the idea that commit B is derived from commit A into data model; it makes calculating what
changed on different people’s branches, and merging those changes together. We will see more on these when we will building
the merge
and branch
commands.
Differences between trees
Let’s get the tree data of the two commits on the test file we made.
❯ git cat-file -p 88e3870
100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt
This is the output of the first commit tree structure. Now let’s get the commit tree structure of the second commit
❯ git cat-file -p 040c6f3e
100644 blob e019be006cf33489e2d0177a3837a2384eddebc5 hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt
As we can see that the hash of the file hello.txt
is different in the both tree object, because the first commits
hello.txt holds the value hello
and the second commits hello.txt holds the value second
, so it has the different
hash in the both commits tree structure.
As for the case of file world.txt
its content remained the same in the both commits, so hash was not different, and
git does not have to store two different compressed files storing the same file content, so git do not save the additional
file for same content across the different commits, it just add the hash of the file, this is how git saves the
storage space for the codebase with millions of line of code and thousands of file.
Implementing the parent
chain
So we do not have to modify our code of storing the blob
and tree
object, and keep updating the .git/HEAD
to
point to the latest commit. We have to add the parent
hash to the new commits, we can know if the commit is root-commit
or not by checking for the .git/HEAD
file, if the file exists then the commit is not a root-commit and add the
.git/HEAD
files hash as the parent hash in the commit object, if we do not find the file, then the commit is a
root-commit and we do not have to add a parent hash to the commit object.
We will introduce an abstraction to updating the HEAD
file to update the HEAD
file easily and safely.
refs.go
package gitgo
import (
"os"
"path/filepath"
)
type ref struct {
pathname string
headPath string
}
func RefInitialize(pathname string) ref {
r := ref{pathname: pathname}
r.headPath = filepath.Join(r.pathname, "HEAD")
return r
}
func (r ref) UpdateHead(oid []byte) error {
flags := os.O_WRONLY|os.O_CREATE
refFile, err := os.OpenFile(r.headPath, flags, 0644)
if err != nil {
return err
}
defer refFile.Close()
_, err = refFile.Write(oid)
if err != nil {
return err
}
return nil
}
Let’s add a function that will return the hash string that the HEAD
file is storing if the file exist, else the
function will return the empty string if the file do not exists, this will tell that the file do not exists, and this
is root-commit.
func (r ref) ReadHead() string {
_, err := os.Stat(r.headPath)
if os.IsNotExist(err) {
return ""
}
content, _ := os.ReadFile(r.headPath)
return string(content)
}
Now we can use this in the commit
handler.
cmd/gitgo/cmdHandler.go
refs := gitgo.RefInitialize(gitgo.GITPATH)
Add this line after storing all the blob and tree object.
author := gitgo.Author{
Name: name,
Email: email,
Timestamp: time.Now(),
}.New()
message := gitgo.ReadStdinMsg()
refs := gitgo.RefInitialize(gitgo.GITPATH)
parent := refs.ReadHead()
is_root := ""
if parent == "" {
is_root = "(root-commit) "
}
commitData := gitgo.Commit{
Parent: parent, // we will add this field to struct
TreeOID: treeHash,
Author: author,
Message: message,
}.New()
cHash, err := gitgo.StoreCommitObject(commitData)
if err != nil {
return err
}
refs.UpdateHead([]byte(cHash))
fmt.Printf("%s %s %s", is_root, cHash, message)
return nil
Remove the code where we are creating the HEAD
file in the Handler. Now we need to adjust the Commit
struct to
add the Parent
hash string.
database.go
type Commit struct {
Parent string
TreeOID string
Author string
Message string
}
And now update the Commit
structs New()
function.
func (c Commit) New() string {
lines := []string{}
lines = append(lines, fmt.Sprintf("tree %s", c.TreeOID))
if c.Parent != "" {
lines = append(lines, fmt.Sprintf("parent %s", c.Parent))
}
lines = append(lines, fmt.Sprintf("author %s", c.Author))
lines = append(lines, fmt.Sprintf("committer %s", c.Author))
lines = append(lines, "")
lines = append(lines, c.Message)
return strings.Join(lines, "\n")
}
With these changes in place we can now build the executable file from the cmd/gitgo
, and run the gitgo
to
create a new commit. We can verify that the new commit has the parent field in it.
Rename .gitgo
folder to .git
, and now use the git’s command hehehe…
❯ git cat-file -p 51db7c69
tree 040c6f3e807f0d433870584bc91e06b6046b955d
parent 706f942675c04ef7cd637022b6ec236b2a98e6da
author <> 1740759443 +0530
comitter <> 1740759443 +0530
Second Commit
So our new commit has the parent field in it. Great.
Safely updating .git/HEAD
When we were creating our code to store the blob
object we where first writing to a temp file and then renaming
it to the permanent file name, because there might be a case where we are writing to a file, and at the same time
another program is reading that file, so the reading program will see the uncomplete content written to file.
So for it not to happen we first write to a temp file and when the writing is complete we rename the file to the
permanent file name.
In case of .git/HEAD
this means that while we are updating the file, other program might try to read it, and it
might see an empty string or a partial hash of the commit.
Files in .git/objects
never change; because the names of the object files are determined by the files content, under
the assumption that SHA1 is doing its job, so for us it does not matter that which of the two different program wins
because they are writing the same content. All we care that this process appears to be atomic and when we read the file
they have the full content in them. For this purpose it’s sufficient to pick a random filename to write the data out to,
and then rename this file.
This is not the case for .git/HEAD
and other refernces - their while purpose is to have stable, predictable names
and change their content over time, enabling us to find the latest commit without the need to know its ID. Writes to
them must still appear atomic, but we must not preassume that two process are trying to write the same thing to the file.
In fact we should preassume that both are trying to write different references to the file is an error, because unless
those process are explicity co-ordinating with each other, they will probably disagree about the state of the system
and the final value of reference will depend on whichever process finish last.
Such race condition become even more important when data must be read from a file, then modified or transformed in some way before being written back to the file. Or, when we need to check the file’s current value before deciding whether to overwrite it - we need to know that the value won’t be changed by another process while we’re making this decision.
So what can we do? We can solve this problem by introducting a new abstraction called a Lockfile
. This will be
initialised with the path of the file we want to change, and it will attemp to open a file to write to by appending .lock
to the original pathname. We need to pick a well-know
name rather than generating a random pathname, because the whole point is to prevent two processes from getting access
to the same resource at once, and this is easier if they’re both trying to access the same file.
Let’s begin writing.
lockfile.go
package gitgo
import (
"errors"
"log"
"os"
"sync"
)
var (
ErrMissingParent = errors.New("Missing Parent")
ErrNoPermission = errors.New("No Permission")
ErrStaleLock = errors.New("Stale Lock")
)
type lockFile struct {
FilePath string
LockPath string
Lock *os.File
mu sync.Mutex
}
func lockInitialize(path string) *lockFile {
lockpath := path + ".lock"
return &lockFile{
FilePath: path,
LockPath: lockpath,
}
}
func (l *lockFile) holdForUpdate() (bool, error) {
l.mu.Lock()
defer l.mu.Unlock()
if l.Lock != nil {
return true, nil // lock already aquired
}
file, err := os.OpenFile(l.LockPath, os.O_RDWR|os.O_CREATE|os.O_EXCL, 0644)
if err != nil {
if os.IsExist(err) {
return false, nil
} else if os.IsNotExist(err) {
return false, ErrMissingParent
} else if os.IsPermission(err) {
return false, ErrNoPermission
}
return false, err
}
l.lock = file
return true, nil
}
func (l *lockFile) write(data []byte) {
l.mu.Lock()
defer l.mu.Unlock()
l.errOnStaleLock()
_, err := l.Lock.Write(data)
if err != nil {
log.Fatalf("Write error: %s\n", err)
}
}
func (l *lockFile) commit() {
l.mu.Lock()
defer l.mu.Unlock()
l.errOnStaleLock()
err := l.Lock.Close()
if err != nil {
log.Fatalf("Err closing file: %s\n", err)
}
err = os.Rename(l.LockPath, l.FilePath)
if err != nil {
log.Fatalf("Err renaming file: %s\n", err)
}
l.Lock = nil
}
func (l *lockFile) errOnStaleLock() {
if l.Lock == nil {
log.Fatalf("Err: %s\nNot holding lock of file: %s\n", ErrStaleLock, l.LockPath)
}
}
The purpose of the holdForUpdate
function is to let the caller attemp to acquire a lock for writing to the file,
and to be told whether they were successful. This is done by attempting to open the .lock
file with the CREATE
and
EXCL
flags, which means the file will be created if it does not exist, and an error will result if it already exists.
If the OpenFile
call succeeds, then we store the file handle in the lockFile.Lock
and return true
. If the file
already exists, we return the error.
After aquiring the lock we need two further methods write
and commit
. The write method builds up data to be written
to the original file, by writing to the file with .lock
. Then after the writing will be done, the commit
function
will rename the file to the original file name, and will update the lockFile.Lock
to null, so that no more data can
be written. Both of these function will stop when lockFile.Lock
does not exists, since that indicates either that the
lock has been released, or that the caller never acquired it in the first place.
Now we can use this in the ref.UpdateHead()
function, to prevent the two instances of gitgo
to move the
.gitgo/HEAD
at the same time.
refs.go
import (
// ...
"errors"
"fmt"
)
var ErrLockDenied = errors.New("Lock Denied")
// ...
func (r ref) UpdateHead(oid []byte) error {
lockfile := lockInitialize(r.headPath)
if lock, _ := lockfile.holdForUpdate(); !lock {
return fmt.Errorf("Err: %s\nCould not aquire lock on file: %s", ErrLockDenied, r.headPath)
}
oid = append(oid, '\n')
lockfile.write(oid)
lockfile.commit()
return nil
}
This modification to refs.go
ensures that the chain of commits is properly recorded, since it prevents .gitgo/HEAD
from being inconsistenlty updated. The lockfile
abstraction will find further use later, as we introduce more types
of record-keeping into the repository.
Don’t overwrite objects
Before move to the next feature, there’s a small improvement we need to make. When we first wrote the cmdHandler
we were writing all the objects whether commit
, tree
or commit
object to disk.
Now on the every subsequent commit, there will be files that will be unchanged, particularly blob
object. It’s
wasteful to re-write these files, so we’d like to avoid
it if possible.
database.go
func StoreObject(data bytes.Buffer,
prefix, folderPath, PermPath string
) error {
err := os.MkdirAll(folderPath, 0755)
if err != nil {
return err
}
// if the file exists exit
_, err = os.Stat(PermPath)
if os.IsExist(err) {
return nil
}
// ...
}
Now we are skipping the files that already exists on the disk.
Afterwords
And voila… there we have it. In this part we created
-
A link to the past, meaning storing the previous commits hash to the subsequent commits as the
parent
field. -
Safely updating the
.gitgo/HEAD
file. -
Not overwriting the
blob
object to the disk.
Great work…
Code Link: Github
Just know this,
Reinvent the wheel, so that you can learn how to invent wheel
-– a nobody