The Problem That Started Everything
I was building a RAG system and realized something terrifying: I had no idea if it was actually working.
The LLM would confidently cite information that wasn't in the retrieved documents. We had passing tests. The API was fast. But something was fundamentally broken, and we couldn't see it.
I asked my team: "Is our retrieval actually good? Which embedding model performs better? Does reranking help? Why did this fail?"
Nobody could answer. We had no visibility.
That moment sparked an idea: RAGEval—a platform for measuring, debugging, and optimizing RAG systems. A place where teams could:
- Upload documents
- Create evaluation datasets
- Run RAG experiments
- Compare configurations side-by-side
- Measure retrieval quality, answer faithfulness, cost, latency
- See exactly why things fail
But before I could build any of that, I needed a foundation. I needed a production-ready API that could:
- Call LLMs reliably
- Handle errors gracefully
- Stream responses in real-time
- Support multiple LLM providers (not locked into one)
- Be fully tested
This is the story of building that foundation in 2 days. And how it became the heartbeat of RAGEval.
Day 1: The Foundation Takes Shape
I started with a blank canvas. No code, no git history. Just the vision of RAGEval and the need to prove it could actually work.
Why a Solid Foundation Matters
Most projects fail not because the idea is bad, but because the foundation is rickety. I wasn't going to make that mistake. Before writing a single feature for RAGEval, I needed something rock-solid underneath.
# Create the project
uv init --name rageeval
cd rageeval
# Add the core dependencies we'll need
uv add fastapi uvicorn python-dotenv pydantic litellm groq
uv add --dev pytest pytest-asyncio httpx
I chose UV because it's fast. Dependency management shouldn't be a bottleneck. I chose FastAPI because it's built for async work—critical when you're calling external APIs. I chose LiteLLM upfront, even though I'd only use OpenAI initially, because RAGEval needs to support any LLM provider. This is intentional architecture, not accidental.
The Health Check (You Always Need This)
# src/main.py
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
async def root():
return {"Message": "Hello from the root"}
Simple. Boring. Essential. Every production API needs a health check. Load balancers need it. Kubernetes needs it. Monitoring needs it.
Push to GitHub. Commit message: "Initial setup".
The Real Work: Building the LLM Completion Endpoint
Now I faced the actual challenge. RAGEval would eventually need to:
- Retrieve documents from a vector database
- Pass them to an LLM with a question
- Generate an answer
- Stream it back to the user
- Handle everything that could go wrong
For now, I'd build just the LLM part. The streaming, async foundation that everything else would rest on.
from fastapi import FastAPI, HTTPException, status
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from litellm import acompletion
from litellm.exceptions import APIError, APIConnectionError
from dotenv import load_dotenv
import logging
load_dotenv()
logger = logging.getLogger("uvicorn.error")
class CompletionRequest(BaseModel):
prompt: str
model: str = "groq/llama-3.3-70b-versatile"
max_tokens: int = 500
@app.post("/complete")
async def request_llm(request: CompletionRequest):
try:
response = await acompletion(
model=request.model,
messages=[{"role": "user", "content": request.prompt}],
max_tokens=request.max_tokens,
stream=True
)
async def stream_generator():
try:
async for chunk in response:
content = chunk.choices[0].delta.content
if content:
yield content
except Exception as stream_err:
logger.error(f"Stream interrupted: {str(stream_err)}")
yield f"\n[Error: Stream Interrupted]"
return StreamingResponse(stream_generator(), media_type="text/plain")
except APIError as api_err:
logger.error(f"LLM API Error: {api_err.message}")
raise HTTPException(
status_code=api_err.status_code,
detail=f"LLM API Error: {api_err.message}"
)
except APIConnectionError as conn_err:
logger.error(f"LLM Connection Error: {str(conn_err)}")
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail="Could not reach LLM."
)
except Exception as e:
logger.error(f"Unexpected error: {str(e)}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="An unexpected internal server error occurred."
)
I made deliberate choices here:
- Async/await everywhere — Non-blocking. This matters when RAGEval eventually handles 100 concurrent evaluation requests.
- Streaming built-in — RAGEval will need to stream evaluation results back to users.
- Groq as default — Fast, open-source model to keep costs down while testing.
- Logging from day 1 — When RAGEval fails in production, I need to know why.
Tested it:
curl -X POST http://localhost:8000/complete \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain what RAG is"}'
The response streamed back token by token. It worked.
End of Day 1. I had a working foundation. Committed. Pushed. Ready to build on it.
Day 2: Making It Production-Ready
Day 1 proved the concept. Day 2 was about making it bulletproof.
The Reality Check: Things Break
I walked through the code with fresh eyes:
- What if Groq's API times out?
- What if someone sends a malformed request?
- What if the network dies mid-stream?
- What if something completely unexpected happens?
Day 1 me had no answers. Day 2 me had error handling.
The code already had the right structure:
except APIError as api_err:
logger.error(f"LLM API Error: {api_err.message}")
raise HTTPException(
status_code=api_err.status_code,
detail=f"LLM API Error: {api_err.message}"
)
except APIConnectionError as conn_err:
logger.error(f"LLM Connection Error: {str(conn_err)}")
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail="Could not reach LLM."
)
except Exception as e:
logger.error(f"Unexpected error: {str(e)}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="An unexpected internal server error occurred."
)
Three distinct error types:
- API errors — The LLM provider returned an error. We log it with context and tell the user.
- Connection errors — We can't reach the LLM provider. Network issue, service down, or auth failure. We respond with 503 (Service Unavailable).
- Everything else — Catch-all. We log it and tell the user something went wrong.
This isn't overcomplicated. It's realistic. Production systems fail. The question is whether you fail gracefully.
Tests: The Confidence Layer
I could have skipped testing. That would have been a mistake.
# tests/test_api.py
import pytest
from httpx import ASGITransport, AsyncClient
from src.main import app
@pytest.mark.asyncio
async def test_root():
transport = ASGITransport(app=app)
async with AsyncClient(
transport=transport,
base_url="http://test"
) as client:
response = await client.get("/")
assert response.status_code == 200
assert response.json() == {"Message": "Hello from the root"}
Basic. But essential. We're testing that the health endpoint works using httpx.ASGITransport, which calls the app in-process. Fast. No network calls.
Now the real test—completion:
# tests/test_completion.py
from unittest.mock import patch
import pytest
from httpx import ASGITransport, AsyncClient
from src.main import app
@pytest.mark.asyncio
async def test_completion():
async def fake_response():
yield type("Chunk", (), {
"choices": [
type("Choice", (), {
"delta": type("Delta", (), {
"content": "Hello from fake LLM"
})
})
]
})
with patch("src.main.acompletion", return_value=fake_response()):
transport = ASGITransport(app=app)
async with AsyncClient(
transport=transport,
base_url="http://test"
) as client:
response = await client.post(
"/complete",
json={"prompt": "hello"}
)
assert response.status_code == 200
assert response.text == "Hello from fake LLM"
Key insight: We mock the LLM call. We don't call Groq. We test that our code correctly handles whatever the LLM returns. This is fast. This is cheap. This is reliable.
And streaming:
# tests/test_streaming.py
@pytest.mark.asyncio
async def test_streaming():
async def fake_response():
for content in ("Explain", " AI"):
yield type("Chunk", (), {
"choices": [
type("Choice", (), {
"delta": type("Delta", (), {
"content": content
})
})
]
})
with patch("src.main.acompletion", return_value=fake_response()):
transport = ASGITransport(app=app)
async with AsyncClient(
transport=transport,
base_url="http://test"
) as client:
response = await client.post(
"/complete",
json={"prompt": "Explain AI"}
)
assert response.status_code == 200
assert response.text == "Explain AI"
This verifies that tokens stream correctly. Two chunks ("Explain" and " AI") concatenate into one response. When real users see "Explain" then " AI" appear token-by-token, they'll know we tested this.
Run everything:
uv run pytest -v
# test_api.py::test_root PASSED
# test_completion.py::test_completion PASSED
# test_streaming.py::test_streaming PASSED
All green. Now I could refactor with confidence. The tests have my back.
Multi-Provider Support: The Hidden Architecture
Look at the CompletionRequest:
class CompletionRequest(BaseModel):
prompt: str
model: str = "groq/llama-3.3-70b-versatile"
max_tokens: int = 500
The model parameter is configurable. Why does this matter?
RAGEval's entire value proposition is evaluation and comparison. Teams will want to test:
- Does Claude perform better than Groq?
- Is OpenAI's embedding model worth the cost?
- Should we use a smaller local model?
I built this in upfront by choosing LiteLLM. Now RAGEval can support any LLM provider without changing the code:
# Groq (default, fast and cheap)
curl -X POST http://localhost:8000/complete \
-d '{"prompt": "Explain RAG", "model": "groq/llama-3.3-70b-versatile"}'
# OpenAI (expensive but capable)
curl -X POST http://localhost:8000/complete \
-d '{"prompt": "Explain RAG", "model": "gpt-4o-mini"}'
# Anthropic (different approach to reasoning)
curl -X POST http://localhost:8000/complete \
-d '{"prompt": "Explain RAG", "model": "claude-3-5-sonnet-20241022"}'
# Local model (no API costs)
curl -X POST http://localhost:8000/complete \
-d '{"prompt": "Explain RAG", "model": "ollama/mistral"}'
All work. One codebase. That's the power of thinking ahead.
Streaming Already Works
The implementation handles streaming correctly:
async def stream_generator():
try:
async for chunk in response:
content = chunk.choices[0].delta.content
if content:
yield content
except Exception as stream_err:
logger.error(f"Stream interrupted: {str(stream_err)}")
yield f"\n[Error: Stream Interrupted]"
return StreamingResponse(stream_generator(), media_type="text/plain")
Users get tokens in real-time. If the stream breaks, they see [Error: Stream Interrupted] instead of silence. This matters for the UX of RAGEval's evaluation interface.
What I Actually Built
In 2 days, I went from nothing to a production-grade foundation:
✅ Async FastAPI Server — Handles concurrent requests without blocking
✅ Structured Validation — Pydantic catches bad input before it reaches the LLM
✅ Comprehensive Error Handling — API errors, connection errors, unknown errors all handled
✅ Structured Logging — Every error logged with context for debugging
✅ Full Test Suite — 3 test files covering health, completion, and streaming
✅ Multi-Provider Support — Groq, OpenAI, Anthropic, local models, anything LiteLLM handles
✅ Streaming Responses — Real-time token generation with error recovery
This isn't a toy project. It's the foundation RAGEval will be built on.
How This Connects to RAGEval's Vision
Right now, I can call an LLM and stream responses. But RAGEval's full picture looks like this:
Phase 1 (Days 1-2 — Done):
- ✅ LLM completion endpoint with streaming
- ✅ Multi-provider support
- ✅ Error handling and logging
- ✅ Full test coverage
Phase 2 (Coming Next):
- Document ingestion (PDF parsing, smart chunking)
- Vector database integration (Qdrant)
- RAG query system (retrieve docs + generate answers)
- Evaluation metrics (faithfulness, relevance, precision, recall)
Phase 3:
- Experiment framework (A/B test configurations)
- Dataset management and evaluation results
- Comparison tables and visualization
Phase 4:
- Production features (authentication, rate limiting, observability)
- Web UI for non-developers
- Integration with popular RAG frameworks
The foundation I built in 2 days is what all of this rests on. It's the API layer. The message queue. The streaming backbone. The error handling that keeps everything running when things break.
Most teams would have built Phase 2 first. I built the foundation that makes Phase 2 possible.
What I Learned Building This
The Small Decisions That Compound
Async/await from day 1 — One blocking I/O call scales into a bottleneck at 100 concurrent requests. I chose async from the start.
Testing before refactoring — When I realized I needed multi-provider support, the tests gave me confidence to refactor. Without them, I'd have spent hours debugging.
Error handling as architecture — Not an afterthought. Built in. This matters because RAGEval will be used to evaluate systems. If RAGEval itself is unreliable, its evaluations are meaningless.
LiteLLM upfront — I could have used just Groq. Using LiteLLM from day 1 meant that when RAGEval needs to compare Groq vs Claude vs OpenAI, the architecture already supports it.
What I'd Do Differently
Environment config — API keys scattered around. Next time,
.env.examplefrom line 1.More granular error tests — I have the main tests, but could add specific tests for timeout scenarios, rate limiting, auth failures.
The Journey Ahead
Standing here at the end of Day 2, I'm not thinking about the endpoint I just built. I'm thinking about the RAG evaluation platform.
I can see it:
- Teams uploading documents
- Creating test datasets of questions
- Running RAG experiments with different configs
- Seeing side-by-side comparisons of faithfulness, relevancy, cost, latency
- Finding the optimal embedding model, chunk size, top-K value
This foundation makes that possible. The next phase gets us closer.
The Code
The full codebase is on GitHub:
github.com/abuhurayraniloy/RAGEval
This is Day 1-2 work. Everything I described above. Ready for the next phase.
What's Next for Me
Tomorrow, I start on document ingestion. PDFs, text files, markdown. Smart chunking. Embedding generation. Getting documents into Qdrant.
The day after, RAG query system. The part where I actually retrieve documents and feed them to the LLM.
That's where RAGEval starts to come alive.
But none of that happens without the foundation I built in these 2 days. None of it.













