CS61B(12): Command Line Programming and Git

⛱️ This is the lecture note of CS61B - Lecture 12.

In this lecture, we will do some warm-up theoretically for the Project 2: Gitlet.

Git is a sophisticated piece of software. Relies on many ideas we have not yet covered:

  • Maps
  • Hashing
  • File I/O
  • Graphs

How Git Works

Every time you commit changes to a file, it stores a copy of the entire repository in a secret folder on your computer called .git.

But maybe, you will wonder that by copying the entire repo, just like we copy the entire folder daily, it seems that there are so much redundancy. So, an important thing is how to avoid redundancy.

In the rest of this lecture, we will discuss various tricks employed to avoid redundancy, and find the best one.

Avoiding Redundancy

Approach 1

However, this approach is very inefficient -- there are lots of repetitive, same works.

Approach 2

In this revised approach 2, we only store files that change.

  • Much more efficient. Avoids storing redundant files.
  • However, checkout is now more complicated. If we checkout a commit, we have to copy files from a variety of different folders.

Approach 3

And this approach has another advantage.

Approach 4

Though the previous approach seems fine, it still has some flaw. So, we will still go ahead, to see the approach used in the real world. I mean, the approach used in Git.

It is Hashing.

First, let's see some advantages of approach 3.

So, we raise a new approach.

Approach 5

Since approach 4 still has flaws, it's time to see the "real" approach used in Git.

Every file has its own git-SHA1 hash:

Then, how git-SHA1 hash works?

Serializable and Storing Data Structures

The commit ID is the git-SHA1 hash of the commit.

  • You might object: “A commit is an object, not a file”.
  • Imagine a file containing the author, date, commit message, list of files and their versions, and parent ID, then git-SHA1 hash that.

After generating commit IDs, we need to store them so that they can be read later.

Branching

We can (attempt to) merge branches, and maybe there are conflicts.

After resolving the conflict. The new commit has two parents!

Note: Commits are no longer a linked list.

  • This is a more general structure called a “graph”.
  • More on graphs later in our class.