I Built PyVCS to Understand How Git Works Internally

Why I Did It

Like many developers, I use Git every day. I know the commands by heart – add, commit, push, log – but I realised I had only a vague idea of what actually happens under the hood. What’s a blob? How does a tree differ from a commit? How does the index (staging area) work? And most mysteriously, how does git push actually send data to a remote server?

I wanted to demystify this. Instead of reading dry documentation, I decided to build my own simplified Git clone. Not to replace Git, but to understand it.

So I wrote PyVCS – a pure-Python version control system that implements a subset of Git’s features. The entire codebase is around 500 lines, and it taught me more than any tutorial could.

What I Set Out to Build

My goals were clear:

Object storage – store blobs, trees, and commits with SHA‑1 hashing and zlib compression.
A staging area (index) – the classic add and commit flow.
Commit history – parent pointers, log, and the ability to walk the chain.
Working tree operations – status and diff.
Remote push – speak the Git pack protocol over HTTP.

I also wanted it to be installable – so that after I was done, I could actually use it as a command-line tool, even if it’s limited.

Step 1: Understanding the Object Model

Git’s object model is surprisingly elegant. There are four types (I implemented three): blobs (file contents), trees (directory listings), and commits (snapshots with metadata). Each object is stored as a header (type size\0) followed by the data, compressed with zlib, and named by its SHA‑1 hash.

The first thing I wrote was hash_object() – it computes the hash, writes the compressed data to .vcs/objects/ab/cdef... (two‑level directory structure), and returns the hash. I also wrote read_object() to decompress and parse the object back.

It was thrilling to see that my handcrafted objects had the exact same format as Git’s – I could even use git cat-file on them if I renamed .vcs to .git (though I didn’t).

Step 2: The Index – A Sneaky Data Structure

The index (.vcs/index) is a binary file that tracks which files are staged and their metadata. I reverse‑engineered the Git index format (version 2) – it’s a header, a series of fixed‑length entries with variable‑length path names, and a trailing SHA‑1 checksum.

Parsing it was a bit tedious with struct.unpack, but once I got it right, I could list staged files with ls-files and compare them to the working tree for status.

The tricky part was building a tree from the index entries. Git requires trees to be sorted and to handle nested directories. I wrote a recursive build_tree_from_entries() that groups entries by directory, builds subtrees, and creates the final tree object. That was a moment of clarity – I finally understood how Git represents directories.

Step 3: Commits and the Master Branch

A commit is just a text object with a tree hash, parent hash (if any), author/committer info, timestamp, and a message. I wrote commit() to:

Write the tree from the current index.
Read the current master pointer (a reference stored in .vcs/refs/heads/master).
Create the commit object.
Update the master ref to the new commit hash.

The log command then simply walks the parent chain, printing each commit’s details. It was like watching a timeline come alive.

Step 4: The Big Challenge – Push

This was the most difficult part. I wanted to push my local commits to a remote Git repository (like GitHub) over HTTP. Git uses a smart protocol with two phases:

Discovery – GET /info/refs?service=git-receive-pack returns the remote’s current refs.
Upload – POST /git-receive-pack sends a packfile containing the objects that the remote doesn’t have, plus a reference update command.

I had to:

Parse the pkt‑line format (each line prefixed with a 4‑byte hex length).
Calculate the set of missing objects between local and remote commits (recursive diff of object graphs).
Build a packfile – a custom binary format with a header, compressed object data for each object, and a trailing SHA‑1.
Encode objects with a variable‑length header (type and size) and compressed data.

The packfile logic was the most intricate – I used the official Git documentation and a lot of trial‑and‑error with xxd and hexdump. When I finally saw unpack ok from the server, I literally cheered.

Step 5: Making It Installable

Once the core was working, I wanted to share it. I structured it as a Python package with a pyproject.toml and a console script entry point. Now anyone can install it with:

pip install pyvcs

and run pyvcs init, pyvcs add, etc. from anywhere.

What I Learned

Git is not magic – it’s a set of simple, well‑designed data structures and protocols.
The index is a cache – it stores file metadata and hashes to speed up commits and diffs.
Trees are recursive – they’re the key to Git’s fast directory comparisons.
Packfiles are clever – they compress and deduplicate objects, but the format is surprisingly straightforward.
HTTP is just bytes – the Git protocol is just a stream of lines and binary chunks; once you understand the framing, it’s not scary.

More than that, I gained a deep appreciation for the design decisions that make Git fast and reliable. Linus Torvalds and the Git community built something truly remarkable.

What PyVCS Can (and Can’t) Do

Can:

init, add, commit, status, diff, log
ls-files and cat-file for inspection
push to a remote HTTP repository (with basic auth)

Can’t:

Branches or tags (only master)
pull or clone (though push works, so half the story is there)
Merging or conflict resolution

It’s not meant to be a production tool – it’s an educational experiment. But it’s functional, and it helped me understand Git from the inside out.

Try It Yourself

I’ve open‑sourced PyVCS on GitHub under the MIT license. If you’ve ever wondered how version control works, I encourage you to clone it, read the code, and maybe even extend it.

git clone https://github.com/abdullahkhaver/pyvcs.git
cd pyvcs
pip install -e .
pyvcs init mytest
cd mytest
# ... play with it

The joy of building something that works, and the insight gained, makes this project one of the most rewarding I’ve ever done.

If you decide to write your own VCS, I’d love to hear about it. Happy hacking!

I Built PyVCS to Understand How Git Works Internally

Why I Did It

What I Set Out to Build

Step 1: Understanding the Object Model

Step 2: The Index – A Sneaky Data Structure

Step 3: Commits and the Master Branch

Step 4: The Big Challenge – Push

Step 5: Making It Installable

What I Learned

What PyVCS Can (and Can’t) Do

Try It Yourself

Tags

Author

Stats

Published

You Might Also Like

Some friends wanted to see how I use DigitalOcean. So I built them the smallest real app I could.

The LLM Visibility Tools Cost $79/Month. Mine is Open Source.

On programming languages, targets, and platforms

I almost added an em-dash remover to my LLM library. Then I tested whether local models even produce em-dashes.

My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit test could catch

Never lose a training run again: a checkpoint-and-resume playbook for ephemeral GPUs