Running AI Models on a Budget: My Experience with Ollama and Free LLMs
I run AI models on a 4GB RAM cloud VM with no GPU. Here's how I made it work with Ollama and free API fallbacks.
The Challenge
- Hardware: 2 CPU cores, 4GB RAM, no GPU
- Budget: $0 for inference
- Goal: Run AI models 24/7 for content generation, code analysis, and automation
My Solution: Ollama + Free API Chain
import requests
class LLMRouter:
def __init__(self):
self.ollama_url = "http://localhost:11434/api/generate"
self.local_models = ["qwen2.5:0.5b", "gemma:2b", "qwen2.5:1.5b"]
self.cloud_models = [
"nvidia/nemotron-nano-9b-v2:free",
"qwen/qwen3-coder:free",
"google/gemma-4-26b-a4b-it:free",
]
def call(self, prompt, max_tokens=500):
# Try local first (free, fast for simple tasks)
for model in self.local_models:
try:
resp = requests.post(self.ollama_url, json={
"model": model,
"prompt": prompt,
"stream": False,
"options": {"num_predict": max_tokens, "num_ctx": 2048}
}, timeout=30)
if resp.status_code == 200:
return resp.json()["response"], model
except:
continue
# Fallback to free cloud APIs
return self._call_cloud(prompt, max_tokens)
Model Selection Guide
Based on my testing on a 4GB machine:
| Model | Size | RAM | Speed | Quality | Best For |
|---|---|---|---|---|---|
| qwen2.5:0.5b | 0.5B | ~1GB | Fast | Basic | Quick tasks |
| qwen2.5:1.5b | 1.5B | ~2GB | Moderate | Good | General use |
| gemma:2b | 2B | ~2GB | Moderate | Good | Creative writing |
| phi3:mini | 3.8B | ~3GB | Slow | Great | Complex reasoning |
Memory Optimization
Running on 4GB RAM requires careful tuning:
# Environment variables
export OLLAMA_NUM_THREADS=1 # Leave 1 core for OS
export OLLAMA_CONTEXT_LENGTH=2048 # Reduce from default 4096
export OLLAMA_KEEP_ALIVE=24h # Keep model in RAM
export OLLAMA_MAX_LOADED_MODELS=1 # Only one model at a time
Why These Settings Matter
- NUM_THREADS = 1: You have 2 cores. If Ollama uses both, the OS and other processes starve and crash.
- CONTEXT_LENGTH = 2048: Default 4096 doubles RAM usage. 2048 is enough for 90% of tasks.
- KEEP_ALIVE = 24h: Cold model loading takes 10-30 seconds on CPU. Keeping it warm eliminates this delay.
- MAX_LOADED_MODELS = 1: Each loaded model consumes RAM. One at a time prevents OOM.
Building a Fallback Chain
The key to reliability is having multiple fallback options:
class ResilientLLM:
def __init__(self):
self.chain = [
("ollama-qwen0.5b", self._ollama, "qwen2.5:0.5b"),
("ollama-qwen1.5b", self._ollama, "qwen2.5:1.5b"),
("ollama-gemma2b", self._ollama, "gemma:2b"),
("openrouter-free", self._openrouter, "nvidia/nemotron-nano-9b-v2:free"),
]
self._current = 0
def call(self, prompt, **kwargs):
for i in range(len(self.chain)):
idx = (self._current + i) % len(self.chain)
name, func, model = self.chain[idx]
try:
result = func(prompt, model=model, **kwargs)
if result:
self._current = idx
return result, name
except:
continue
return None, None
Real-World Performance
On my 4GB cloud VM:
- Simple text generation (50 words): 5-10 seconds local, 2-3 seconds cloud
- Code generation (100 lines): 30-60 seconds local, 10-15 seconds cloud
- Complex analysis: 2-5 minutes local, 15-30 seconds cloud
- Daily inference cost: $0 (100% local + free tier APIs)
Tips for Budget AI
- Use the smallest model that works — qwen2.5:0.5b handles 60% of tasks
- Cache aggressively — don't re-generate identical content
- Batch requests — process multiple items in one prompt
- Monitor memory — kill unused models before OOM
- Free tier APIs exist — OpenRouter offers free models as fallback
Conclusion
You don't need expensive GPUs to run AI. With Ollama, smart model selection, and free API fallbacks, you can build a fully autonomous AI system on a $0/month budget. The key is optimization, not hardware.













