I kept seeing "OpenAI-compatible" stamped on projects that have nothing to do with OpenAI. Ollama. vLLM. LM Studio. Most of the local model tools I run on my own hardware. None of them are OpenAI, yet they all advertise the same compatibility badge. So I went looking for what that badge actually means, and the answer turned out to be more interesting than I expected.
There isn't one OpenAI API spec. There are two formats, they serve different purposes, and the one everyone copied is not the one OpenAI now tells you to use.
Two formats, not one
OpenAI exposes two main ways to send text and multimodal requests to its models. The older one is the Chat Completions API. The newer one is the Responses API, introduced in March 2025 and now recommended for new projects.
This distinction matters because of a quiet mix-up I see all the time. When people say a tool is "OpenAI-compatible," they almost always mean Chat Completions. That's the format the rest of the industry cloned. The Responses API is the direction OpenAI is steering everyone toward, but it isn't the thing that became a standard. Knowing which one you're talking about saves a lot of confusion.
Let me walk through both, then get to why this happened and whether you should care.
How Chat Completions works
Chat Completions models a request as a list of messages, where each message carries a role. The roles do the work:
-
developersets the persona and the rules. Older models called thissystem. -
userholds the human prompt. -
assistantholds the model's previous replies, which is how you replay conversation history. A request looks like this:
{
"model": "gpt-5.4-mini",
"messages": [
{
"role": "developer",
"content": "You are a helpful assistant that speaks like a 1920s detective."
},
{
"role": "user",
"content": "Where did I leave my keys?"
}
],
"temperature": 0.7
}
The response wraps the answer inside a choices array. The array exists because you can ask for more than one variation with the n parameter, so even a single reply comes back at index zero:
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-5.4-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Listen here, pal. If I knew where your brass keys were hiding, I'd be buying juice, not cracking wise."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 34,
"completion_tokens": 26,
"total_tokens": 60
}
}
There's one catch worth knowing. Chat Completions is stateless, so the server doesn't remember your conversation. Every turn, you resend the full message history, including any tool outputs from earlier in the exchange. For a simple chatbot that's fine. For a multi-step agent that calls tools, it gets unwieldy fast.
Why Chat Completions became the standard
OpenAI shipped this format in 2023, it worked, and it arrived early. That timing matters more than any technical merit.
Once enough tutorials and production apps were written against the messages and choices shape, the format stopped belonging to OpenAI in any practical sense. Other model providers added compatibility so developers could swap them in without rewriting code. Local inference engines like vLLM and Ollama did the same. So did gateways like OpenRouter, whose whole job is normalizing access to many models behind one interface. At that point, supporting the format wasn't a favor to OpenAI. It was table stakes for anyone who wanted developers to adopt their thing.
This is a classic network effect. The format won because it was everywhere, and it stayed everywhere because it won. Simon Willison flagged the obvious risk back in 2025: a whole industry was building clones of one company's proprietary API, and that company could change it whenever it liked.
How the Responses API differs
So OpenAI changed it. Sort of.
The Responses API is a redesign aimed at agents rather than chatbots. It separates your standing instructions from the actual input, and it can hold conversation state on the server so you stop resending the whole transcript.
A request is leaner:
{
"model": "gpt-5.5",
"instructions": "You are a concise data analyst.",
"input": "Summarize our Q2 performance."
}
The instructions field carries the system-level guidance. You pass the actual prompt through input, either as a plain string or a list of messages. If you want the server to remember the last turn, you send store: true and then pass a previous_response_id on the next call instead of replaying everything.
The response is where I see the most misinformation, so here's the real shape. The output is not a single top-level string. It's a typed output array of items, because a single response can now contain a message, a tool call, a reasoning trace, and more, all on one timeline:
{
"id": "resp_9z8x7c...",
"object": "response",
"created_at": 1782384000,
"model": "gpt-5.5",
"status": "completed",
"output": [
{
"type": "message",
"id": "msg_...",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "Q2 revenue rose 14% quarter over quarter, driven mostly by enterprise software renewals.",
"annotations": []
}
]
}
],
"usage": {
"input_tokens": 28,
"output_tokens": 16,
"total_tokens": 44
}
}
You'll see response.output_text in a lot of examples, and it looks like a top-level field. It isn't. It's a convenience helper in the SDK that digs into the output array and pulls out the text for you. Handy, but don't expect it in the raw JSON. Also note the token counts renamed themselves: Chat Completions reports prompt_tokens and completion_tokens, while Responses uses input_tokens and output_tokens. Small thing, easy to trip on.
OpenAI's pitch for Responses is concrete. Better cache utilization cuts cost on multi-turn workloads. Reasoning models score higher because the API preserves their reasoning context between turns. Built-in tools like web search and code execution save you from wiring up your own function-calling loop. For agent work, those are real wins.
There's a deeper reason the stateful design matters, and it's easy to miss. Reasoning models generate a hidden chain of thought before they answer. In a stateless setup, the client has to send the whole history back every turn. That forces an awkward choice. You either strip the reasoning out and lose the model's train of thought, or ship it back and forth as encrypted blocks. Holding state on the server avoids both. OpenAI keeps the reasoning trace on its own backend from one turn to the next, so the model stays sharp without exposing how it got there. That's one of the real reasons behind the push.
Open Responses changes the question
Here's the part that made this worth a blog post. In January 2026, OpenAI and a group of partners published Open Responses, an open-source specification built on the Responses API.
The launch partners are telling. Hugging Face, Vercel, OpenRouter, LM Studio, Ollama, and vLLM all signed on. These are the same tools that cloned Chat Completions on their own. This time, instead of reverse-engineering a proprietary format and hoping it doesn't shift, they helped write a documented spec with formal acceptance tests and a shared schema. The vLLM team said as much: they used to guess at provider behavior, and a real spec ends that.
The idea is one schema you describe requests and outputs against once, then run across OpenAI, local models, or other providers with minimal translation. Notably absent from the launch lineup were Anthropic and Google DeepMind, which keep their own formats. Even so, the spec lists both as targets it aims to reach through adapter layers, so the plan is to cover them, not route around them. This isn't a universal peace treaty, but it's the open part of the field agreeing on a baseline.
There's an irony nobody is hiding. Building an "open" standard on top of one company's API is a strange way to escape that company's gravity. But for anyone tired of writing a wrapper around a wrapper, a documented, testable spec beats a de facto one that lives at OpenAI's discretion.
A few parameters worth knowing
Whichever format you call, a handful of payload options control behavior:
-
temperaturesets randomness. Push it to 0.9 for creative output, drop it to 0.2 for focused, near-deterministic answers. -
streamset totruereturns text token by token over Server-Sent Events instead of making you wait for the full reply. The streaming events differ between the two formats, which is one more reason adapters need a spec. - Structured output is handled differently in each. Chat Completions uses
response_formatwith a JSON schema. Responses usestext.format. Both can force valid JSON out of the model. - For length limits, watch the naming. Chat Completions deprecated
max_tokensin favor ofmax_completion_tokens. Responses usesmax_output_tokens. Same intent, three different names, depending on where you are. ## Where I landed
The practical takeaway, especially if you self-host like I do, is that compatibility is the whole game. Because Ollama and vLLM speak the OpenAI formats, I can point a tool at a local model and a frontier model with the same code and only a base URL between them. That portability is worth real money and real freedom, and it's exactly what Open Responses is trying to protect going forward.
If you're starting something new, build against the Responses API and lean on the Open Responses spec where you can. If you're maintaining an existing app, Chat Completions isn't going anywhere. OpenAI has committed to supporting it indefinitely, and the rest of the field still runs on it. Either way, knowing which spec you're actually talking about is half the battle.
Further reading: OpenAI's migration guide, the Open Responses specification, and Simon Willison's original take on the standardization risk.
Photo by Zdeněk Macháček on Unsplash













