This article was originally published on runaihome.com
TL;DR: A "CUDA out of memory" error almost always means one of three things โ your context window is too long, your KV cache or batch is reserving VRAM up front, or fragmentation is wasting memory you technically have. The fastest wins: shrink context, quantize the KV cache, and set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. You rarely need a bigger GPU; you need a tighter config.
What you'll be able to do after this guide:
- Read the error line and know which allocation blew up โ model weights, KV cache, or activations
- Apply the right fix per engine (Ollama, llama.cpp, ComfyUI, vLLM) instead of guessing
- Free 30โ60% of your VRAM without downgrading the model you actually want to run
Honest take: The number-one cause is a context window you never asked for. Tools like Ollama and vLLM will happily pre-reserve KV cache for an 8K, 32K, or 128K window even when your prompt is 400 tokens. Cap the context to what you actually use and most OOMs disappear before you touch anything else.
First, read the error โ it tells you which allocation failed
The canonical message looks like this:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.99 GiB of which 1.43 GiB is free.
Two numbers matter: total capacity and how much was free when it failed. If it died trying to allocate a large block while several GB were still "reserved but unallocated," that's fragmentation, not a true shortage โ different fix. If it died with almost nothing free, you genuinely overcommitted, and you need to cut something real (context, batch, model size, or precision).
Before changing anything, watch the GPU while the job runs:
$ nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1
memory.used [MiB], memory.total [MiB]
22310 MiB, 24576 MiB
If memory.used climbs steadily until the crash, your context or batch is the leak. If it spikes at one node (a VAE decode, a long-context prefill), that single step is the culprit and the fix is local to it.
Where VRAM actually goes
Three buckets compete for the same card:
- Model weights โ fixed once loaded. An 8B model at Q4 is ~4.5 GB; at FP16 it's ~16 GB.
- KV cache โ grows with context length and concurrent requests. This is the silent killer.
- Activations / working buffers โ transient, but a 4K-resolution VAE decode in ComfyUI can momentarily need several GB.
The KV cache is where most people lose. Cutting an 8B model's context from 8192 to 2048 tokens saves roughly 1.5 GB; on a 70B model the same cut frees 6 GB or more, because the cache scales with layer count and hidden size. That's free VRAM with zero quality loss as long as your prompts genuinely fit the smaller window. If you don't understand quant levels yet, the quantization explainer and the Q4 vs Q5 vs Q6 vs Q8 quality breakdown are worth a detour โ they decide bucket #1's size.
Fix it in Ollama
Ollama's OOMs are almost always a context or KV-cache problem, because it defaults to a generous context for many models.
1. Cap the context. Set it per run or bake it into a Modelfile:
# one-off
$ ollama run llama3.1:8b --ctx-size 2048
# permanent, via Modelfile
PARAMETER num_ctx 2048
2. Enable Flash Attention. It reduces KV-cache VRAM by 30โ50% with no quality degradation, and it unlocks cache quantization:
$ export OLLAMA_FLASH_ATTENTION=1
3. Quantize the KV cache. With Flash Attention on, set the cache type. q8_0 halves the cache for a negligible quality hit; q4_0 cuts it to roughly a third, with some loss on very long contexts:
$ export OLLAMA_KV_CACHE_TYPE=q8_0
Flash Attention plus a q8_0 cache together let you push context lengths roughly 2ร higher before you run out of memory.
One trap worth knowing: KV-cache quantization only applies to architectures on Ollama's allowlist. Force q8_0 on an unsupported architecture and the server silently falls back to f16 โ so you set the flag, see no savings, and still OOM. If quantizing the cache changes nothing, that's why; check your model's architecture support before assuming the flag is broken.
If Ollama still spills to CPU or refuses the GPU entirely, that's a different symptom โ see Ollama not using your GPU, which covers the driver and passthrough side.
Fix it in llama.cpp
llama.cpp gives you the most direct controls. The three levers, in order of impact:
$ ./llama-server -m model.gguf \
-c 2048 \ # context size โ the biggest lever
-ngl 28 \ # GPU layers; lower this to keep some layers on CPU
-fa \ # flash attention
-ctk q8_0 -ctv q8_0 # quantize K and V cache
-ngl (number of GPU layers) is your safety valve: a model that won't fully fit can run partially on the GPU and partially on the CPU. You lose speed for every layer that lands on the CPU, but it runs. Drop -ngl by 4โ8 at a time until it loads, then check nvidia-smi for the headroom you have left. If you're routinely offloading half the model, your VRAM tier is the real constraint โ the how much VRAM for Llama models guide maps model size to the card you actually need, and system RAM matters once you're offloading.
Fix it in ComfyUI (and Stable Diffusion / Flux)
Image and video models OOM differently: weights are smaller, but a single VAE decode or a high-resolution latent can spike VRAM hard at one node.
Launch flags are the first move. Add them to your run script:
$ python main.py --lowvram
# or, on 12GB or less:
$ python main.py --lowvram --force-fp16
--medvram moves model components to system RAM when they're idle, cutting peak VRAM by roughly 30โ40% at the cost of 10โ20% slower generation. --lowvram is more aggressive โ more savings, bigger speed penalty. For Flux specifically, set the Load Checkpoint node's weight_dtype to fp8_e4m3fn, which roughly halves model VRAM.
Move the VAE off the GPU. The decode step is a common spike. Running it on the CPU costs a few seconds but saves several hundred MB to a couple of GB at the exact moment you tend to crash:
$ python main.py --lowvram --cpu-vae
Free memory between runs. ComfyUI can hold the previous model in VRAM. Drop a "Free Model and Clip Memory" node after generation, or install a memory-management pack, so back-to-back workflows don't stack. If you're chasing speed and VRAM together on RTX 40/50 cards, the ComfyUI NVFP4 guide covers the format that does both.
Fix it in vLLM
vLLM is the trickiest because it pre-reserves KV-cache blocks up front for throughput. With max-model-len=32768 and max-num-seqs=256, even a 7B model's KV cache can balloon past 20 GB โ before a single real request arrives.
$ vllm serve mymodel \
--max-model-len 4096 \ # lock to your real prompt+gen length
--gpu-memory-utilization 0.90 \ # raise carefully; lower on shared hosts
--kv-cache-dtype fp8 \ # Ampere or newer
--enforce-eager # skips CUDA-graph pre-allocation
The single most effective change is --max-model-len: set it to the longest prompt plus generation you actually serve, not the model's theoretical maximum. vLLM reserves blocks for the full window, so any gap between your real prompt length and max-model-len is wasted VRAM. After that, --enforce-eager reclaims the memory CUDA graphs pre-allocate, --kv-cache-dtype fp8 halves cache size on Ampere or newer, and --cpu-offload-gb 4 buys headroom if you're still short.
On gpu-memory-utilization: the default 0.90 is aggressive for a 24 GB workstation card or any shared host โ start at 0.85 there. On






