Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing

Most LangGraph content stops at the notebook. You build a cute ReAct loop, it answers one question, and the article ends before the hard part: how do you actually serve this thing, swap models without a rewrite, and see what it's doing when it misbehaves?

This post walks through a small but production-shaped LangGraph deployment: a RAG ReAct agent that

exposes an OpenAI-compatible HTTP API, so any OpenAI client (Open WebUI, the openai SDK, LibreChat) can talk to it unchanged,
routes every model call through a gateway so switching from a hosted API to self-hosted vLLM is a config change, not a code change, and
gets full tracing — node transitions, tool calls, and LLM calls in one trace — by adding a single callback.

Every snippet below is real code from a working service. Roughly 150 lines of Python is all it takes.

The shape of the thing

OpenAI client (Open WebUI, openai SDK)
        │  POST /v1/chat/completions
        ▼
FastAPI router ──► LangGraph StateGraph ──► LLM Gateway ──► model (hosted API today, vLLM tomorrow)
        │                   │
        │                   └──► ToolNode ──► Qdrant (RAG)
        │
        └──► Langfuse callback (one trace per request)

The contract with the outside world is just the OpenAI API. Everything interesting — the graph, RAG, tracing — lives behind that boundary. That single decision is what lets an off-the-shelf chat UI drive a custom agent with zero adapter code.

1. The ReAct graph

The graph is deliberately tiny: one agent node that reasons, one tools node that retrieves, and a conditional edge that loops between them until the model stops asking for tools.

# app/graph/builder.py
from langgraph.graph import END, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition

def build_graph():
    g = StateGraph(AgentState)
    g.add_node("agent", agent_node)
    g.set_entry_point("agent")

    # ReAct: if the model emits tool_calls, go to `tools`; otherwise END.
    g.add_node("tools", ToolNode(TOOLS))
    g.add_conditional_edges("agent", tools_condition)
    g.add_edge("tools", "agent")
    return g.compile()

tools_condition and ToolNode are LangGraph prebuilts that do the unglamorous work: inspect the last message for tool_calls, route accordingly, execute the tools, and append ToolMessages back into state. You wire the loop; they run it.

State is a single shared message log with a reducer that appends rather than replaces:

# app/graph/state.py
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages

class AgentState(TypedDict, total=False):
    messages: Annotated[list[BaseMessage], add_messages]

add_messages is the reducer. Every node returns {"messages": [...]} and LangGraph merges it into the running log — no manual list-shuffling, and it's what makes the agent⇄tools loop accumulate context correctly.

The agent node binds the tools and calls the model. Note bind_tools is conditional — flip RAG off and the exact same node degrades to a plain single-shot chat call:

# app/graph/nodes/agent.py
async def agent_node(state: AgentState) -> dict:
    llm = get_llm()
    if get_settings().rag_enabled:
        llm = llm.bind_tools(TOOLS)
    messages = [SystemMessage(content=SYSTEM_PROMPT), *state["messages"]]
    response = await llm.ainvoke(messages)
    return {"messages": [response]}

And the tool itself is an ordinary @tool-decorated function. The docstring is not documentation — it's the prompt the model reads to decide when to call it:

# app/graph/tools.py
@tool
def search_docs(query: str) -> str:
    """Search internal docs for content relevant to the question.
    When the user asks about the project/system/docs, call this first."""
    hits = get_vector_store().similarity_search(query, k=get_settings().rag_top_k)
    blocks = [
        f"[{i}] (source: {doc.metadata.get('source', 'unknown')})\n{doc.page_content.strip()}"
        for i, doc in enumerate(hits, 1)
    ]
    return "\n\n".join(blocks) or "No relevant documents found."

Returning a [1] (source: ...) structure isn't cosmetic — it's how the model can cite sources in its final answer, which is the difference between a demo and something people trust.

2. The OpenAI-compatible surface

Here's the lever that makes everything else cheap: the agent speaks OpenAI's wire format. The router turns an incoming /v1/chat/completions request into graph input and the graph's output back into an OpenAI response.

# app/api/router.py
@router.post("/v1/chat/completions")
async def chat_completions(req: ChatCompletionRequest):
    graph = get_graph()
    inputs = {"messages": to_langchain_messages(req.messages)}
    config: dict = {}

    if not req.stream:
        result = await graph.ainvoke(inputs, config=config)
        text = extract_final_text(result.get("messages", []))
        return make_completion(text, settings.served_model_name)

    return StreamingResponse(
        graph_to_openai_sse(graph, inputs, settings.served_model_name, config=config),
        media_type="text/event-stream",
    )

Because the response matches OpenAI's schema (including SSE streaming chunks), Open WebUI thinks it's talking to OpenAI. You point its openaiBaseUrl at this service and your custom RAG agent shows up as a selectable model. No frontend work.

3. One gateway, many models

LangGraph nodes never name a provider. They call one factory:

# app/llm/client.py
from langchain_openai import ChatOpenAI

def get_llm(model=None, temperature=None, streaming=True) -> ChatOpenAI:
    s = get_settings()
    return ChatOpenAI(
        base_url=f"{s.litellm_url}/v1",   # gateway, not a provider
        api_key=s.litellm_key,
        model=model or s.default_model,
        temperature=s.default_temperature if temperature is None else temperature,
        streaming=streaming,
    )

The base_url points at a LiteLLM gateway, not at any specific vendor. LiteLLM exposes an OpenAI-compatible endpoint and fans out to whatever its model_list says — a hosted API today, self-hosted vLLM tomorrow. Migrating off a paid API to an in-cluster GPU model becomes a gateway config edit; this Python file never changes.

There's one deliberate escape hatch — when the gateway is down locally, point straight at Ollama's OpenAI-compatible endpoint:

    if s.chat_provider.lower() == "ollama":
        return ChatOpenAI(base_url=f"{s.ollama_url}/v1", api_key="ollama",
                          model=model or s.ollama_chat_model, ...)

Same ChatOpenAI class, different base_url. The OpenAI-compatible interface shows up three times in this architecture — inbound API, gateway, and local fallback — and that consistency is the whole trick.

4. Tracing in one line

A multi-node graph with a tool loop is opaque when it goes wrong. Did the model skip the tool? Retrieve garbage? Loop twice? Langfuse's LangChain callback captures the entire run — every node transition, tool call, and LLM call — as a single nested trace.

The integration is genuinely one object:

# app/obs/langfuse.py
from functools import lru_cache

@lru_cache
def get_langfuse_handler():
    s = get_settings()
    if not (s.langfuse_public_key and s.langfuse_secret_key):
        return None  # no keys → tracing silently disabled (safe for local/POC)
    from langfuse.langchain import CallbackHandler
    return CallbackHandler()

Heads-up for the SDK version churn: on Langfuse SDK v3+ the import is from langfuse.langchain import CallbackHandler, and the handler reads LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST from the environment — you don't pass keys to the constructor anymore. This tripped up a lot of v2 tutorials.

Then attach it per request via the graph config — which is also where you stamp user/session metadata so traces are filterable in the Langfuse UI:

# app/api/router.py
handler = get_langfuse_handler()
if handler is not None:
    config["callbacks"] = [handler]
    config["metadata"] = {
        "langfuse_user_id": req.user or "anonymous",
        "langfuse_session_id": getattr(req, "chat_id", None) or "no-session",
        "langfuse_tags": ["my-agent", settings.served_model_name],
    }

Passing the handler through config["callbacks"] (rather than baking it into the LLM client) means it propagates down the entire graph automatically. One request → one trace → every step visible.

What this buys you

Concern	How it's handled	Why it scales
Frontend integration	OpenAI-compatible API	Any OpenAI client works unchanged
Model choice	LiteLLM gateway behind `ChatOpenAI`	Swap providers via config, not code
Agent logic	LangGraph `StateGraph` + prebuilts	ReAct loop in ~10 lines, extensible to multi-agent
Observability	Langfuse callback via graph `config`	One trace per request, zero per-node wiring
Local dev	Ollama fallback through same interface	No gateway needed to hack offline

None of these pieces is exotic. The point is the seams: an OpenAI boundary on the outside, a gateway boundary on the model side, and a callback boundary for observability. Get the seams right and the agent in the middle stays small and swappable.

The same skeleton extends cleanly to a supervisor/worker multi-agent graph, a Postgres checkpointer for persistent threads, and an in-cluster vLLM model — each is an additive change behind one of those seams. But that's a follow-up post.

Built with LangGraph, LangChain, LiteLLM, Qdrant, and Langfuse. If you're running LangGraph in production and want to compare notes on deployment patterns, reach out.