Enter fullscreen mode

I run LiteLLM as my AI gateway. 100+ providers, one OpenAI-compatible API. It works, it scales, I like it. But after a year of pushing traffic through the Python proxy, one thing kept bugging me: memory.

Under concurrent load, the Python proxy peaks around 359MB. Multiply that across pods, regions, retries. OOM kills at the worst possible time. You know the feeling.

LiteLLM just announced they're migrating the entire hot path to Rust. Not a rewrite. Not a v2. Same config.yaml, same database, same API. The runtime underneath just gets faster.

I went through their benchmark numbers. They look real.

The numbers

	Rust gateway	LiteLLM Python
Per-request overhead	~0.05ms	~7.5ms
Throughput (50 concurrent)	6,782 req/s	453 req/s
Peak memory under load	31.7MB	358.9MB

15x throughput. 11x less memory. 150x lower per-request overhead. The harness is checked into the repo so you can reproduce it yourself.

For most workloads, gateway overhead is noise compared to model latency. A Claude call takes 500ms to 30s. Adding 7ms vs 0.05ms, who cares. But for high-throughput stuff like classification batches, embeddings at scale, or coding agents hammering completions, it adds up fast.

How they're doing it

The migration is a clean four-stage plan:

Stage 0: Pure Python (today)
Stage 1: Rust core via PyO3, Python still does I/O
Stage 2: FastAPI thin shell, entire hot path in Rust
Stage 3: Pure Rust server (axum), Python plugins in sidecar

What I like about this approach: they're not flipping everything at once. Each route moves individually. OCR first (smallest surface, no streaming). Then /v1/messages (adds streaming). Then /chat/completions (largest param surface). One provider at a time, parity check gates every step.

The Rust core is pure transforms. It turns your request into a provider request, turns the response back, handles stream chunks, counts tokens. No sockets, no secrets, no database access. Python keeps doing I/O until Stage 3. Clean separation.

Timeline

Aug 15  - litellm.ocr() → Rust
Sep 1   - /messages, /chat/completions → Rust
Sep 15  - Router (load balancing, fallbacks, retries) → Rust
Dec 1   - Full server: axum replaces FastAPI

What stays the same

Everything you care about:

Same config.yaml
Same database and schema
Same client API, same request/response shapes
Same providers, routing, keys
Custom Python plugins keep working in the sidecar

You deploy the Rust binary, it uses ~65MB of memory, overhead stays under 1ms. Nothing in your setup changes.

Why this matters

The "Python is slow" argument against LiteLLM was always a stretch. Gateway overhead is maybe 0.3% of total latency on a typical LLM call. Most of the time you're waiting on the model, not the proxy.

But now even that argument is gone. Sub-1ms overhead, 32MB memory, 6,782 req/s on a single instance. Good luck finding a lighter gateway that still covers 100+ providers.

Full architecture diagrams and the reproducible benchmark setup are in the announcement: docs.litellm.ai/blog/litellm-rust-launch

Curious if anyone else is running their AI gateway through Rust. What's your setup look like?