Enter fullscreen mode

The Problem That Started Everything

I was building a RAG system and realized something terrifying: I had no idea if it was actually working.

The LLM would confidently cite information that wasn't in the retrieved documents. We had passing tests. The API was fast. But something was fundamentally broken, and we couldn't see it.

I asked my team: "Is our retrieval actually good? Which embedding model performs better? Does reranking help? Why did this fail?"

Nobody could answer. We had no visibility.

That moment sparked an idea: RAGEval—a platform for measuring, debugging, and optimizing RAG systems. A place where teams could:

Upload documents
Create evaluation datasets
Run RAG experiments
Compare configurations side-by-side
Measure retrieval quality, answer faithfulness, cost, latency
See exactly why things fail

But before I could build any of that, I needed a foundation. I needed a production-ready API that could:

Call LLMs reliably
Handle errors gracefully
Stream responses in real-time
Support multiple LLM providers (not locked into one)
Be fully tested

This is the story of building that foundation in 2 days. And how it became the heartbeat of RAGEval.

Day 1: The Foundation Takes Shape

I started with a blank canvas. No code, no git history. Just the vision of RAGEval and the need to prove it could actually work.

Why a Solid Foundation Matters

Most projects fail not because the idea is bad, but because the foundation is rickety. I wasn't going to make that mistake. Before writing a single feature for RAGEval, I needed something rock-solid underneath.

# Create the project
uv init --name rageeval
cd rageeval

# Add the core dependencies we'll need
uv add fastapi uvicorn python-dotenv pydantic litellm groq
uv add --dev pytest pytest-asyncio httpx

I chose UV because it's fast. Dependency management shouldn't be a bottleneck. I chose FastAPI because it's built for async work—critical when you're calling external APIs. I chose LiteLLM upfront, even though I'd only use OpenAI initially, because RAGEval needs to support any LLM provider. This is intentional architecture, not accidental.

The Health Check (You Always Need This)

# src/main.py
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def root():
    return {"Message": "Hello from the root"}

Simple. Boring. Essential. Every production API needs a health check. Load balancers need it. Kubernetes needs it. Monitoring needs it.

Push to GitHub. Commit message: "Initial setup".

The Real Work: Building the LLM Completion Endpoint

Now I faced the actual challenge. RAGEval would eventually need to:

Retrieve documents from a vector database
Pass them to an LLM with a question
Generate an answer
Stream it back to the user
Handle everything that could go wrong

For now, I'd build just the LLM part. The streaming, async foundation that everything else would rest on.

from fastapi import FastAPI, HTTPException, status
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from litellm import acompletion
from litellm.exceptions import APIError, APIConnectionError
from dotenv import load_dotenv
import logging

load_dotenv()
logger = logging.getLogger("uvicorn.error")

class CompletionRequest(BaseModel):
    prompt: str
    model: str = "groq/llama-3.3-70b-versatile"
    max_tokens: int = 500

@app.post("/complete")
async def request_llm(request: CompletionRequest):
    try:
        response = await acompletion(
            model=request.model,
            messages=[{"role": "user", "content": request.prompt}],
            max_tokens=request.max_tokens,
            stream=True
        )

        async def stream_generator():
            try:
                async for chunk in response:
                    content = chunk.choices[0].delta.content
                    if content:
                        yield content
            except Exception as stream_err:
                logger.error(f"Stream interrupted: {str(stream_err)}")
                yield f"\n[Error: Stream Interrupted]"

        return StreamingResponse(stream_generator(), media_type="text/plain")

    except APIError as api_err:
        logger.error(f"LLM API Error: {api_err.message}")
        raise HTTPException(
            status_code=api_err.status_code,
            detail=f"LLM API Error: {api_err.message}" 
        )

    except APIConnectionError as conn_err:
        logger.error(f"LLM Connection Error: {str(conn_err)}")
        raise HTTPException(
            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
            detail="Could not reach LLM."
        )

    except Exception as e:
        logger.error(f"Unexpected error: {str(e)}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail="An unexpected internal server error occurred."
        )

I made deliberate choices here:

Async/await everywhere — Non-blocking. This matters when RAGEval eventually handles 100 concurrent evaluation requests.
Streaming built-in — RAGEval will need to stream evaluation results back to users.
Groq as default — Fast, open-source model to keep costs down while testing.
Logging from day 1 — When RAGEval fails in production, I need to know why.

Tested it:

curl -X POST http://localhost:8000/complete \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain what RAG is"}'

The response streamed back token by token. It worked.

End of Day 1. I had a working foundation. Committed. Pushed. Ready to build on it.

Day 2: Making It Production-Ready

Day 1 proved the concept. Day 2 was about making it bulletproof.

The Reality Check: Things Break

I walked through the code with fresh eyes:

What if Groq's API times out?
What if someone sends a malformed request?
What if the network dies mid-stream?
What if something completely unexpected happens?

Day 1 me had no answers. Day 2 me had error handling.

The code already had the right structure:

except APIError as api_err:
    logger.error(f"LLM API Error: {api_err.message}")
    raise HTTPException(
        status_code=api_err.status_code,
        detail=f"LLM API Error: {api_err.message}" 
    )

except APIConnectionError as conn_err:
    logger.error(f"LLM Connection Error: {str(conn_err)}")
    raise HTTPException(
        status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
        detail="Could not reach LLM."
    )

except Exception as e:
    logger.error(f"Unexpected error: {str(e)}")
    raise HTTPException(
        status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
        detail="An unexpected internal server error occurred."
    )

Three distinct error types:

API errors — The LLM provider returned an error. We log it with context and tell the user.
Connection errors — We can't reach the LLM provider. Network issue, service down, or auth failure. We respond with 503 (Service Unavailable).
Everything else — Catch-all. We log it and tell the user something went wrong.

This isn't overcomplicated. It's realistic. Production systems fail. The question is whether you fail gracefully.

Tests: The Confidence Layer

I could have skipped testing. That would have been a mistake.

# tests/test_api.py
import pytest
from httpx import ASGITransport, AsyncClient
from src.main import app

@pytest.mark.asyncio
async def test_root():
    transport = ASGITransport(app=app)
    async with AsyncClient(
        transport=transport,
        base_url="http://test"
    ) as client:
        response = await client.get("/")

    assert response.status_code == 200
    assert response.json() == {"Message": "Hello from the root"}

Basic. But essential. We're testing that the health endpoint works using httpx.ASGITransport, which calls the app in-process. Fast. No network calls.

Now the real test—completion:

# tests/test_completion.py
from unittest.mock import patch
import pytest
from httpx import ASGITransport, AsyncClient
from src.main import app

@pytest.mark.asyncio
async def test_completion():
    async def fake_response():
        yield type("Chunk", (), {
            "choices": [
                type("Choice", (), {
                    "delta": type("Delta", (), {
                        "content": "Hello from fake LLM"
                    })
                })
            ]
        })

    with patch("src.main.acompletion", return_value=fake_response()):
        transport = ASGITransport(app=app)
        async with AsyncClient(
            transport=transport,
            base_url="http://test"
        ) as client:
            response = await client.post(
                "/complete",
                json={"prompt": "hello"}
            )

    assert response.status_code == 200
    assert response.text == "Hello from fake LLM"

Key insight: We mock the LLM call. We don't call Groq. We test that our code correctly handles whatever the LLM returns. This is fast. This is cheap. This is reliable.

And streaming:

# tests/test_streaming.py
@pytest.mark.asyncio
async def test_streaming():
    async def fake_response():
        for content in ("Explain", " AI"):
            yield type("Chunk", (), {
                "choices": [
                    type("Choice", (), {
                        "delta": type("Delta", (), {
                            "content": content
                        })
                    })
                ]
            })

    with patch("src.main.acompletion", return_value=fake_response()):
        transport = ASGITransport(app=app)
        async with AsyncClient(
            transport=transport,
            base_url="http://test"
        ) as client:
            response = await client.post(
                "/complete",
                json={"prompt": "Explain AI"}
            )

    assert response.status_code == 200
    assert response.text == "Explain AI"

This verifies that tokens stream correctly. Two chunks ("Explain" and " AI") concatenate into one response. When real users see "Explain" then " AI" appear token-by-token, they'll know we tested this.

Run everything:

uv run pytest -v
# test_api.py::test_root PASSED
# test_completion.py::test_completion PASSED
# test_streaming.py::test_streaming PASSED

All green. Now I could refactor with confidence. The tests have my back.

Multi-Provider Support: The Hidden Architecture

Look at the CompletionRequest:

class CompletionRequest(BaseModel):
    prompt: str
    model: str = "groq/llama-3.3-70b-versatile"
    max_tokens: int = 500

The model parameter is configurable. Why does this matter?

RAGEval's entire value proposition is evaluation and comparison. Teams will want to test:

Does Claude perform better than Groq?
Is OpenAI's embedding model worth the cost?
Should we use a smaller local model?

I built this in upfront by choosing LiteLLM. Now RAGEval can support any LLM provider without changing the code:

# Groq (default, fast and cheap)
curl -X POST http://localhost:8000/complete \
  -d '{"prompt": "Explain RAG", "model": "groq/llama-3.3-70b-versatile"}'

# OpenAI (expensive but capable)
curl -X POST http://localhost:8000/complete \
  -d '{"prompt": "Explain RAG", "model": "gpt-4o-mini"}'

# Anthropic (different approach to reasoning)
curl -X POST http://localhost:8000/complete \
  -d '{"prompt": "Explain RAG", "model": "claude-3-5-sonnet-20241022"}'

# Local model (no API costs)
curl -X POST http://localhost:8000/complete \
  -d '{"prompt": "Explain RAG", "model": "ollama/mistral"}'

All work. One codebase. That's the power of thinking ahead.

Streaming Already Works

The implementation handles streaming correctly:

async def stream_generator():
    try:
        async for chunk in response:
            content = chunk.choices[0].delta.content
            if content:
                yield content
    except Exception as stream_err:
        logger.error(f"Stream interrupted: {str(stream_err)}")
        yield f"\n[Error: Stream Interrupted]"

return StreamingResponse(stream_generator(), media_type="text/plain")

Users get tokens in real-time. If the stream breaks, they see [Error: Stream Interrupted] instead of silence. This matters for the UX of RAGEval's evaluation interface.

What I Actually Built

In 2 days, I went from nothing to a production-grade foundation:

✅ Async FastAPI Server — Handles concurrent requests without blocking

✅ Structured Validation — Pydantic catches bad input before it reaches the LLM

✅ Comprehensive Error Handling — API errors, connection errors, unknown errors all handled

✅ Structured Logging — Every error logged with context for debugging

✅ Full Test Suite — 3 test files covering health, completion, and streaming

✅ Multi-Provider Support — Groq, OpenAI, Anthropic, local models, anything LiteLLM handles

✅ Streaming Responses — Real-time token generation with error recovery

This isn't a toy project. It's the foundation RAGEval will be built on.

How This Connects to RAGEval's Vision

Right now, I can call an LLM and stream responses. But RAGEval's full picture looks like this:

Phase 1 (Days 1-2 — Done):

✅ LLM completion endpoint with streaming
✅ Multi-provider support
✅ Error handling and logging
✅ Full test coverage

Phase 2 (Coming Next):

Document ingestion (PDF parsing, smart chunking)
Vector database integration (Qdrant)
RAG query system (retrieve docs + generate answers)
Evaluation metrics (faithfulness, relevance, precision, recall)

Phase 3:

Experiment framework (A/B test configurations)
Dataset management and evaluation results
Comparison tables and visualization

Phase 4:

Production features (authentication, rate limiting, observability)
Web UI for non-developers
Integration with popular RAG frameworks

The foundation I built in 2 days is what all of this rests on. It's the API layer. The message queue. The streaming backbone. The error handling that keeps everything running when things break.

Most teams would have built Phase 2 first. I built the foundation that makes Phase 2 possible.

What I Learned Building This

The Small Decisions That Compound

Async/await from day 1 — One blocking I/O call scales into a bottleneck at 100 concurrent requests. I chose async from the start.
Testing before refactoring — When I realized I needed multi-provider support, the tests gave me confidence to refactor. Without them, I'd have spent hours debugging.
Error handling as architecture — Not an afterthought. Built in. This matters because RAGEval will be used to evaluate systems. If RAGEval itself is unreliable, its evaluations are meaningless.
LiteLLM upfront — I could have used just Groq. Using LiteLLM from day 1 meant that when RAGEval needs to compare Groq vs Claude vs OpenAI, the architecture already supports it.

What I'd Do Differently

Environment config — API keys scattered around. Next time, .env.example from line 1.
More granular error tests — I have the main tests, but could add specific tests for timeout scenarios, rate limiting, auth failures.

The Journey Ahead

Standing here at the end of Day 2, I'm not thinking about the endpoint I just built. I'm thinking about the RAG evaluation platform.

I can see it:

Teams uploading documents
Creating test datasets of questions
Running RAG experiments with different configs
Seeing side-by-side comparisons of faithfulness, relevancy, cost, latency
Finding the optimal embedding model, chunk size, top-K value

This foundation makes that possible. The next phase gets us closer.

The Code

The full codebase is on GitHub:

github.com/abuhurayraniloy/RAGEval

This is Day 1-2 work. Everything I described above. Ready for the next phase.

What's Next for Me

Tomorrow, I start on document ingestion. PDFs, text files, markdown. Smart chunking. Embedding generation. Getting documents into Qdrant.

The day after, RAG query system. The part where I actually retrieve documents and feed them to the LLM.

That's where RAGEval starts to come alive.

But none of that happens without the foundation I built in these 2 days. None of it.

Building RAGEval: My Journey from Problem to Production Foundation in 2 Days

The Problem That Started Everything

Day 1: The Foundation Takes Shape

Why a Solid Foundation Matters

The Health Check (You Always Need This)

The Real Work: Building the LLM Completion Endpoint

Day 2: Making It Production-Ready

The Reality Check: Things Break

Tests: The Confidence Layer

Multi-Provider Support: The Hidden Architecture

Streaming Already Works

What I Actually Built

How This Connects to RAGEval's Vision

What I Learned Building This

The Small Decisions That Compound

What I'd Do Differently

The Journey Ahead

The Code

What's Next for Me

Tags

Author

Stats

Published

You Might Also Like

The Principle of Least AI

. .. . ... . .... . .... . ... .

I'm not a developer, but I built a calendar app to fix my most annoying work task

Too cheap to be good? Think again.

The 80/20 Rule of AI Code — Why the Last 20% Takes 80% of Your Time

Internmaxxing vs. Old Man Shakes Fist at Cloud