My numpy rewrite was 300x faster in isolation. End-to-end it was slower.

my CLI search took 4.3 seconds for 5000 files. i swapped the python cosine loop for a numpy matmul. it got slower.

short-lived python CLIs punish optimizing the wrong layer.

this is what happened shipping vemb 0.3.0: what i tried first that regressed performance, and what actually worked.

the setup

vemb is a CLI that wraps Gemini Embedding 2 for text, images, audio, video, and PDFs. vemb search ./files "query" embeds every file, caches the vectors, and returns the top matches by cosine similarity.

the cache in 0.2.0 was JSON:

{
  "version": 1,
  "model": "gemini-embedding-2-preview",
  "dim": 3072,
  "entries": {
    "file.png:size:mtime": { "values": [0.012, -0.034, ...] }
  }
}

the search path looked like this:

cache = json.loads(Path('.vemb/cache.json').read_text())
for f in files:
    values = cache[cache_key(f)]['values']   # python list of 3072 floats
    score = cosine_similarity(query_emb, values)
    results.append((score, f))

cosine_similarity was pure python. a for-loop over a list of 3072 floats. at 5000 files, that's about 15 million float multiplies and adds in python.

the obvious optimization: replace the python loop with numpy.

the first rewrite

i added a batched cosine that takes a query vector and a matrix of document vectors:

def cosine_similarity_batch(query, matrix):
    q = np.asarray(query, dtype=np.float32)
    m = np.asarray(matrix, dtype=np.float32)
    q_norm = np.linalg.norm(q)
    m_norms = np.linalg.norm(m, axis=1)
    return (m @ q) / (m_norms * q_norm)

then i benchmarked it in isolation, numpy already imported:

N	python loop	numpy batch	speedup
100	19.55ms	0.13ms	148x
1000	196ms	0.70ms	278x
10000	1975ms	12.32ms	160x

this was the win i expected. i pushed the branch, bumped the version, ran the test suite, and was about to tag v0.3.0.

then i ran the end-to-end benchmark.

the real test

methodology: time vemb search /tmp/test "query" --dim 3072 in a fresh shell subprocess, not a python REPL. warm cache (all vectors already embedded), best of three runs on Apple Silicon.

N	main (v0.2.0)	numpy branch	result
200	1.32s	1.59s	20% slower
1000	2.34s	2.27s	tied
5000	4.31s	6.96s	60% slower

numpy was winning in the synthetic test by 100-300x. end-to-end it lost at every real scale i cared about.

and the loss got worse as N grew. at N=5000, user CPU on the numpy branch was 5.2 seconds vs 2.9 seconds on main. burning 80% more CPU to go slower.

where did the 4.3s go?

rough breakdown of the main v0.2.0 command at N=5000, warm cache:

json.loads of the 317 MB cache: ~2.5s
gemini API round-trip for the query embedding: ~0.6s
pure-python cosine loop over 5000 × 3072 floats: ~1.0s
file scan, click boot, sort, print: ~0.2s

the loop was ~25% of the command. the thing i was about to optimize was not the thing eating my wall time.

two hidden costs

cost one: numpy import.

python -c 'import numpy' on macOS takes ~180ms cold. that's a lot of native startup work before you touch the actual math.

i added import numpy as np at the top of embed.py. cli.py imports embed.py. every invocation of vemb — even vemb --version, even vemb text "hello" — now paid 180ms just to start.

on vemb --version, user CPU went from 0.77s on main to 2.42s on the branch. for a command that doesn't touch numpy at all.

cost two: asarray conversion.

the cache was JSON. loading it gave me a dict of Python lists. to use m @ q numpy needed a contiguous float32 matrix.

m = np.asarray(matrix, dtype=np.float32)

at N=5000 D=3072 that's 15 million Python floats that numpy has to pull out of a list of lists and copy into a dense array. on my machine that single asarray call took ~2 seconds.

so to save ~1 second of python cosine, i was paying 180ms of numpy import plus 2 seconds of list-to-matrix conversion.

the synthetic benchmark never measured any of this. it measured the kernel in isolation, after numpy was already loaded and the matrix was already built. that benchmark was lying to me.

the real fix

the compute wasn't the bottleneck. the cache format was.

parsing 317MB of JSON takes 2-3 seconds by itself. json.loads is slow. it also produces python objects that numpy has to re-unpack into a contiguous array.

so i replaced the cache with a binary numpy matrix.

.vemb/
  vectors.npy      # float32 (N, D) matrix, pre-normalized
  manifest.json    # {key: row_index, ...} + metadata

the new search path:

keys, matrix = load_cache(directory, dim)       # np.load, milliseconds

query = np.asarray(query_emb, dtype=np.float32)
query /= np.linalg.norm(query)

scores = matrix @ query                          # single BLAS call

three things changed together:

vectors load from .npy in tens of milliseconds instead of seconds. a .npy file is a tiny header plus a raw memory dump of the array. np.load can mmap it if you pass mmap_mode='r'.
they land as a contiguous float32 matrix already, so there's no asarray conversion.
they're pre-normalized on write, so cosine reduces to a plain dot product at query time.

i also moved import numpy as np out of module scope and into the function that needs it. vemb --version doesn't pay the import tax anymore. only vemb search does.

cosine stayed the same for this corpus. this was a storage and execution change, not a model change. top-k results matched the old code in my tests.

what shipped

N	v0.2.0	v0.3.0	speedup
200	1.62s	1.48s	tied
1000	2.34s	1.51s	1.5x
5000	4.31s	1.64s	2.6x

cache size at N=5000 3072-dim:

old: 317 MB cache.json
new: 61 MB vectors.npy + 229 KB manifest.json (5x smaller)

existing .vemb/cache.json caches auto-migrate to the binary format on first load. no user action.

the takeaway

for short-lived python CLIs, the bottleneck is often IO, deserialization, or imports. hot-loop speedups can miss the real cost. the synthetic benchmark said the loop was slow. the real benchmark said the loop ran for 1 second inside a 4-second command.

if you only measure the kernel, you optimize the kernel. if you measure end-to-end in a fresh subprocess, you find the real cost.

the methodology that exposed this:

same input corpus
cold process, not a REPL
real disk cache
best of three runs
measure wall time AND user CPU (they tell different stories)

the numpy branch is still on GitHub. i never merged it. the fix was in a layer i wasn't even looking at.

the principle

measure the whole CLI in a fresh process before touching the hot loop.

pip install -U vemb

repo: github.com/yuvrajangadsingh/vemb

My numpy rewrite was 300x faster in isolation. End-to-end it was slower.

the setup

the first rewrite

the real test

where did the 4.3s go?

two hidden costs

the real fix

what shipped

the takeaway

the principle

Tags

Author

Stats

Published

You Might Also Like

Dreaming Is Useful. Structured Memory Is Better

RAG 2.0 in 2026: Why Your Current Approach is Already Outdated