my CLI search took 4.3 seconds for 5000 files. i swapped the python cosine loop for a numpy matmul. it got slower.
short-lived python CLIs punish optimizing the wrong layer.
this is what happened shipping vemb 0.3.0: what i tried first that regressed performance, and what actually worked.
the setup
vemb is a CLI that wraps Gemini Embedding 2 for text, images, audio, video, and PDFs. vemb search ./files "query" embeds every file, caches the vectors, and returns the top matches by cosine similarity.
the cache in 0.2.0 was JSON:
{
"version": 1,
"model": "gemini-embedding-2-preview",
"dim": 3072,
"entries": {
"file.png:size:mtime": { "values": [0.012, -0.034, ...] }
}
}
the search path looked like this:
cache = json.loads(Path('.vemb/cache.json').read_text())
for f in files:
values = cache[cache_key(f)]['values'] # python list of 3072 floats
score = cosine_similarity(query_emb, values)
results.append((score, f))
cosine_similarity was pure python. a for-loop over a list of 3072 floats. at 5000 files, that's about 15 million float multiplies and adds in python.
the obvious optimization: replace the python loop with numpy.
the first rewrite
i added a batched cosine that takes a query vector and a matrix of document vectors:
def cosine_similarity_batch(query, matrix):
q = np.asarray(query, dtype=np.float32)
m = np.asarray(matrix, dtype=np.float32)
q_norm = np.linalg.norm(q)
m_norms = np.linalg.norm(m, axis=1)
return (m @ q) / (m_norms * q_norm)
then i benchmarked it in isolation, numpy already imported:
| N | python loop | numpy batch | speedup |
|---|---|---|---|
| 100 | 19.55ms | 0.13ms | 148x |
| 1000 | 196ms | 0.70ms | 278x |
| 10000 | 1975ms | 12.32ms | 160x |
this was the win i expected. i pushed the branch, bumped the version, ran the test suite, and was about to tag v0.3.0.
then i ran the end-to-end benchmark.
the real test
methodology: time vemb search /tmp/test "query" --dim 3072 in a fresh shell subprocess, not a python REPL. warm cache (all vectors already embedded), best of three runs on Apple Silicon.
| N | main (v0.2.0) | numpy branch | result |
|---|---|---|---|
| 200 | 1.32s | 1.59s | 20% slower |
| 1000 | 2.34s | 2.27s | tied |
| 5000 | 4.31s | 6.96s | 60% slower |
numpy was winning in the synthetic test by 100-300x. end-to-end it lost at every real scale i cared about.
and the loss got worse as N grew. at N=5000, user CPU on the numpy branch was 5.2 seconds vs 2.9 seconds on main. burning 80% more CPU to go slower.
where did the 4.3s go?
rough breakdown of the main v0.2.0 command at N=5000, warm cache:
-
json.loadsof the 317 MB cache: ~2.5s - gemini API round-trip for the query embedding: ~0.6s
- pure-python cosine loop over 5000 × 3072 floats: ~1.0s
- file scan, click boot, sort, print: ~0.2s
the loop was ~25% of the command. the thing i was about to optimize was not the thing eating my wall time.
two hidden costs
cost one: numpy import.
python -c 'import numpy' on macOS takes ~180ms cold. that's a lot of native startup work before you touch the actual math.
i added import numpy as np at the top of embed.py. cli.py imports embed.py. every invocation of vemb — even vemb --version, even vemb text "hello" — now paid 180ms just to start.
on vemb --version, user CPU went from 0.77s on main to 2.42s on the branch. for a command that doesn't touch numpy at all.
cost two: asarray conversion.
the cache was JSON. loading it gave me a dict of Python lists. to use m @ q numpy needed a contiguous float32 matrix.
m = np.asarray(matrix, dtype=np.float32)
at N=5000 D=3072 that's 15 million Python floats that numpy has to pull out of a list of lists and copy into a dense array. on my machine that single asarray call took ~2 seconds.
so to save ~1 second of python cosine, i was paying 180ms of numpy import plus 2 seconds of list-to-matrix conversion.
the synthetic benchmark never measured any of this. it measured the kernel in isolation, after numpy was already loaded and the matrix was already built. that benchmark was lying to me.
the real fix
the compute wasn't the bottleneck. the cache format was.
parsing 317MB of JSON takes 2-3 seconds by itself. json.loads is slow. it also produces python objects that numpy has to re-unpack into a contiguous array.
so i replaced the cache with a binary numpy matrix.
.vemb/
vectors.npy # float32 (N, D) matrix, pre-normalized
manifest.json # {key: row_index, ...} + metadata
the new search path:
keys, matrix = load_cache(directory, dim) # np.load, milliseconds
query = np.asarray(query_emb, dtype=np.float32)
query /= np.linalg.norm(query)
scores = matrix @ query # single BLAS call
three things changed together:
- vectors load from
.npyin tens of milliseconds instead of seconds. a.npyfile is a tiny header plus a raw memory dump of the array.np.loadcan mmap it if you passmmap_mode='r'. - they land as a contiguous float32 matrix already, so there's no
asarrayconversion. - they're pre-normalized on write, so cosine reduces to a plain dot product at query time.
i also moved import numpy as np out of module scope and into the function that needs it. vemb --version doesn't pay the import tax anymore. only vemb search does.
cosine stayed the same for this corpus. this was a storage and execution change, not a model change. top-k results matched the old code in my tests.
what shipped
| N | v0.2.0 | v0.3.0 | speedup |
|---|---|---|---|
| 200 | 1.62s | 1.48s | tied |
| 1000 | 2.34s | 1.51s | 1.5x |
| 5000 | 4.31s | 1.64s | 2.6x |
cache size at N=5000 3072-dim:
- old: 317 MB
cache.json - new: 61 MB
vectors.npy+ 229 KBmanifest.json(5x smaller)
existing .vemb/cache.json caches auto-migrate to the binary format on first load. no user action.
the takeaway
for short-lived python CLIs, the bottleneck is often IO, deserialization, or imports. hot-loop speedups can miss the real cost. the synthetic benchmark said the loop was slow. the real benchmark said the loop ran for 1 second inside a 4-second command.
if you only measure the kernel, you optimize the kernel. if you measure end-to-end in a fresh subprocess, you find the real cost.
the methodology that exposed this:
- same input corpus
- cold process, not a REPL
- real disk cache
- best of three runs
- measure wall time AND user CPU (they tell different stories)
the numpy branch is still on GitHub. i never merged it. the fix was in a layer i wasn't even looking at.
the principle
measure the whole CLI in a fresh process before touching the hot loop.
pip install -U vemb








