Why I Did It
Like many developers, I use Git every day. I know the commands by heart – add, commit, push, log – but I realised I had only a vague idea of what actually happens under the hood. What’s a blob? How does a tree differ from a commit? How does the index (staging area) work? And most mysteriously, how does git push actually send data to a remote server?
I wanted to demystify this. Instead of reading dry documentation, I decided to build my own simplified Git clone. Not to replace Git, but to understand it.
So I wrote PyVCS – a pure-Python version control system that implements a subset of Git’s features. The entire codebase is around 500 lines, and it taught me more than any tutorial could.
What I Set Out to Build
My goals were clear:
- Object storage – store blobs, trees, and commits with SHA‑1 hashing and zlib compression.
-
A staging area (index) – the classic
addandcommitflow. - Commit history – parent pointers, log, and the ability to walk the chain.
-
Working tree operations –
statusanddiff. - Remote push – speak the Git pack protocol over HTTP.
I also wanted it to be installable – so that after I was done, I could actually use it as a command-line tool, even if it’s limited.
Step 1: Understanding the Object Model
Git’s object model is surprisingly elegant. There are four types (I implemented three): blobs (file contents), trees (directory listings), and commits (snapshots with metadata). Each object is stored as a header (type size\0) followed by the data, compressed with zlib, and named by its SHA‑1 hash.
The first thing I wrote was hash_object() – it computes the hash, writes the compressed data to .vcs/objects/ab/cdef... (two‑level directory structure), and returns the hash. I also wrote read_object() to decompress and parse the object back.
It was thrilling to see that my handcrafted objects had the exact same format as Git’s – I could even use git cat-file on them if I renamed .vcs to .git (though I didn’t).
Step 2: The Index – A Sneaky Data Structure
The index (.vcs/index) is a binary file that tracks which files are staged and their metadata. I reverse‑engineered the Git index format (version 2) – it’s a header, a series of fixed‑length entries with variable‑length path names, and a trailing SHA‑1 checksum.
Parsing it was a bit tedious with struct.unpack, but once I got it right, I could list staged files with ls-files and compare them to the working tree for status.
The tricky part was building a tree from the index entries. Git requires trees to be sorted and to handle nested directories. I wrote a recursive build_tree_from_entries() that groups entries by directory, builds subtrees, and creates the final tree object. That was a moment of clarity – I finally understood how Git represents directories.
Step 3: Commits and the Master Branch
A commit is just a text object with a tree hash, parent hash (if any), author/committer info, timestamp, and a message. I wrote commit() to:
- Write the tree from the current index.
- Read the current
masterpointer (a reference stored in.vcs/refs/heads/master). - Create the commit object.
- Update the master ref to the new commit hash.
The log command then simply walks the parent chain, printing each commit’s details. It was like watching a timeline come alive.
Step 4: The Big Challenge – Push
This was the most difficult part. I wanted to push my local commits to a remote Git repository (like GitHub) over HTTP. Git uses a smart protocol with two phases:
-
Discovery –
GET /info/refs?service=git-receive-packreturns the remote’s current refs. -
Upload –
POST /git-receive-packsends a packfile containing the objects that the remote doesn’t have, plus a reference update command.
I had to:
- Parse the pkt‑line format (each line prefixed with a 4‑byte hex length).
- Calculate the set of missing objects between local and remote commits (recursive diff of object graphs).
- Build a packfile – a custom binary format with a header, compressed object data for each object, and a trailing SHA‑1.
- Encode objects with a variable‑length header (type and size) and compressed data.
The packfile logic was the most intricate – I used the official Git documentation and a lot of trial‑and‑error with xxd and hexdump. When I finally saw unpack ok from the server, I literally cheered.
Step 5: Making It Installable
Once the core was working, I wanted to share it. I structured it as a Python package with a pyproject.toml and a console script entry point. Now anyone can install it with:
pip install pyvcs
and run pyvcs init, pyvcs add, etc. from anywhere.
What I Learned
- Git is not magic – it’s a set of simple, well‑designed data structures and protocols.
- The index is a cache – it stores file metadata and hashes to speed up commits and diffs.
- Trees are recursive – they’re the key to Git’s fast directory comparisons.
- Packfiles are clever – they compress and deduplicate objects, but the format is surprisingly straightforward.
- HTTP is just bytes – the Git protocol is just a stream of lines and binary chunks; once you understand the framing, it’s not scary.
More than that, I gained a deep appreciation for the design decisions that make Git fast and reliable. Linus Torvalds and the Git community built something truly remarkable.
What PyVCS Can (and Can’t) Do
Can:
-
init,add,commit,status,diff,log -
ls-filesandcat-filefor inspection -
pushto a remote HTTP repository (with basic auth)
Can’t:
- Branches or tags (only
master) -
pullorclone(though push works, so half the story is there) - Merging or conflict resolution
It’s not meant to be a production tool – it’s an educational experiment. But it’s functional, and it helped me understand Git from the inside out.
Try It Yourself
I’ve open‑sourced PyVCS on GitHub under the MIT license. If you’ve ever wondered how version control works, I encourage you to clone it, read the code, and maybe even extend it.
git clone https://github.com/abdullahkhaver/pyvcs.git
cd pyvcs
pip install -e .
pyvcs init mytest
cd mytest
# ... play with it
The joy of building something that works, and the insight gained, makes this project one of the most rewarding I’ve ever done.
If you decide to write your own VCS, I’d love to hear about it. Happy hacking!













