Building a production chatbot requires balancing latency, context management, and inference cost. Most providers bill by the token, which means every system prompt, retrieved document, and prior turn increases the price of the next request. Oxlo.ai is a developer-first inference platform that uses flat, request-based pricing. Each API call costs the same regardless of prompt length, so multi-turn conversations and long-context retrieval do not inflate your bill. This guide covers the architecture, model selection, and implementation patterns for building conversational agents on Oxlo.ai.
Architecture of a Modern LLM Chatbot
A reliable chatbot has three layers: a state manager for conversation history, a retrieval or tool layer for external data, and an inference backend that generates responses. The inference layer should support streaming, function calling, and high context limits without penalizing you for using them. Oxlo.ai provides all of these capabilities through an API that is fully OpenAI SDK compatible, so you can drop it into existing architectures without rewriting client code.
Selecting a Model for Conversation
Oxlo.ai hosts more than 45 open-source and proprietary models across seven categories. For general-purpose dialogue, Llama 3.3 70B is a strong default. If your users speak multiple languages or you are building agent workflows, Qwen 3 32B offers multilingual reasoning. For coding assistants or technical support bots, DeepSeek V3.2 provides solid reasoning and is available on the free tier. When you need advanced chain-of-thought reasoning, Kimi K2.5 or Kimi K2 Thinking are good options, while Kimi K2.6 adds vision and a 131K context window for multimodal chat. You can switch between these models by changing a single parameter in your request, with no cold starts on popular models.
SDK Setup and Authentication
Because Oxlo.ai is fully OpenAI SDK compatible, you can use the official Python or Node.js client. Set the base URL to https://api.oxlo.ai/v1 and pass your Oxlo.ai API key.
from openai import OpenAI
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "How do I design a conversation state manager?"}
]
)
print(response.choices[0].message.content)
Managing Conversation State and Context
Chatbots need to maintain message history. The simplest pattern is to append each turn to a messages list and send the full array to the API. For long sessions, you may need to truncate or summarize history to stay within model context limits. Unlike token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale, Oxlo.ai does not charge more when your message array grows. Request-based pricing means a conversation with ten prior turns costs the same as a single-turn prompt, provided you send them in one request. This makes Oxlo.ai significantly cheaper for long-context and agentic workloads.
messages = [
{"role": "system", "content": "You are a concise technical assistant."}
]
def chat(user_input):
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages,
stream=True
)
reply = ""
for chunk in response:
if chunk.choices[0].delta.content:
reply += chunk.choices[0].delta.content
messages.append({"role": "assistant", "content": reply})
return reply
Adding Tools with Function Calling
Function calling lets your chatbot interact with external APIs, databases, or calculators. Oxlo.ai supports tool use across its chat models. Define your tools in the JSON Schema format and handle the model's tool calls in your application logic.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="qwen3-32b",
messages=messages,
tools=tools,
tool_choice="auto"
)
Streaming for Real-Time UX
Users expect chatbots to type back in real time. Oxlo.ai supports streaming responses via Server-Sent Events. Enable streaming by setting stream=True in your request and iterate over chunks as they arrive.
response = client.chat.completions.create(
model="deepseek-v3.2",
messages=messages,
stream=True
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Handling Image Input with Vision Models
For multimodal chatbots, Oxlo.ai offers vision models such as Gemma 3 27B and Kimi VL A3B. You can pass image URLs or base64-encoded images directly in the user message using the standard OpenAI message format.
messages = [
{"role": "user", "content": [
{"type": "text", "text": "Describe this diagram."},
{"type": "image_url", "image_url": {"url": "https://example.com/diagram.png"}}
]}
]
response = client.chat.completions.create(
model="gemma-3-27b-it",
messages=messages
)
Cost Predictability at Scale
Token-based billing makes cost forecasting difficult for chatbots because user behavior determines input length. Oxlo.ai replaces token math with flat per-request pricing. Whether you send a one-line greeting or a full conversation history with retrieved documents, the cost is the same. Request-based pricing can be 10-100x cheaper than token-based for long-context workloads. Oxlo.ai offers a free plan with 60 requests per day and access to more than 16 models, including DeepSeek V3.2. Paid plans include Pro at $80 per month for 1,000 daily requests, Premium at $350 per month for 5,000 daily requests with priority queue access, and Enterprise tiers with custom unlimited volume and dedicated GPUs. See the pricing page for current plan details.
Putting It All Together
Below is a minimal but complete chatbot client that maintains state, streams output, and handles tool calls. It uses Oxlo.ai as the backend and works with any of the platform's chat models.
from openai import OpenAI
import json
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
def get_weather(location):
return json.dumps({"location": location, "temperature": "22°C"})
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
messages = [{"role": "system", "content": "You are a helpful assistant."}]
while True:
user = input("User: ")
if user.lower() in ["exit", "quit"]:
break
messages.append({"role": "user", "content": user})
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages,
tools=tools,
stream=True
)
reply = ""
for chunk in response:
delta = chunk.choices[0].delta
if delta.content:
reply += delta.content
print(delta.content, end="", flush=True)
elif delta.tool_calls:
print("\n[Tool call detected]")
print()
messages.append({"role": "assistant", "content": reply})
Conclusion
Building a chatbot on Oxlo.ai means you can focus on user experience and conversation design instead of optimizing token counts to control costs. With request-based pricing, more than 45 models, and full OpenAI SDK compatibility, Oxlo.ai is a practical inference backend for everything from simple support bots to complex agentic assistants. Start with the free tier to prototype, then scale without worrying about ballooning context expenses.











