I run LiteLLM as my AI gateway. 100+ providers, one OpenAI-compatible API. It works, it scales, I like it. But after a year of pushing traffic through the Python proxy, one thing kept bugging me: memory.
Under concurrent load, the Python proxy peaks around 359MB. Multiply that across pods, regions, retries. OOM kills at the worst possible time. You know the feeling.
LiteLLM just announced they're migrating the entire hot path to Rust. Not a rewrite. Not a v2. Same config.yaml, same database, same API. The runtime underneath just gets faster.
I went through their benchmark numbers. They look real.
The numbers
| Rust gateway | LiteLLM Python | |
|---|---|---|
| Per-request overhead | ~0.05ms | ~7.5ms |
| Throughput (50 concurrent) | 6,782 req/s | 453 req/s |
| Peak memory under load | 31.7MB | 358.9MB |
15x throughput. 11x less memory. 150x lower per-request overhead. The harness is checked into the repo so you can reproduce it yourself.
For most workloads, gateway overhead is noise compared to model latency. A Claude call takes 500ms to 30s. Adding 7ms vs 0.05ms, who cares. But for high-throughput stuff like classification batches, embeddings at scale, or coding agents hammering completions, it adds up fast.
How they're doing it
The migration is a clean four-stage plan:
Stage 0: Pure Python (today)
Stage 1: Rust core via PyO3, Python still does I/O
Stage 2: FastAPI thin shell, entire hot path in Rust
Stage 3: Pure Rust server (axum), Python plugins in sidecar
What I like about this approach: they're not flipping everything at once. Each route moves individually. OCR first (smallest surface, no streaming). Then /v1/messages (adds streaming). Then /chat/completions (largest param surface). One provider at a time, parity check gates every step.
The Rust core is pure transforms. It turns your request into a provider request, turns the response back, handles stream chunks, counts tokens. No sockets, no secrets, no database access. Python keeps doing I/O until Stage 3. Clean separation.
Timeline
Aug 15 - litellm.ocr() → Rust
Sep 1 - /messages, /chat/completions → Rust
Sep 15 - Router (load balancing, fallbacks, retries) → Rust
Dec 1 - Full server: axum replaces FastAPI
What stays the same
Everything you care about:
- Same
config.yaml - Same database and schema
- Same client API, same request/response shapes
- Same providers, routing, keys
- Custom Python plugins keep working in the sidecar
You deploy the Rust binary, it uses ~65MB of memory, overhead stays under 1ms. Nothing in your setup changes.
Why this matters
The "Python is slow" argument against LiteLLM was always a stretch. Gateway overhead is maybe 0.3% of total latency on a typical LLM call. Most of the time you're waiting on the model, not the proxy.
But now even that argument is gone. Sub-1ms overhead, 32MB memory, 6,782 req/s on a single instance. Good luck finding a lighter gateway that still covers 100+ providers.
Full architecture diagrams and the reproducible benchmark setup are in the announcement: docs.litellm.ai/blog/litellm-rust-launch
Curious if anyone else is running their AI gateway through Rust. What's your setup look like?













