I built a terminal-based Architecture Explainer to onboard our backend team on how Llama 3, Qwen 3, and DeepSeek V3 actually work under the hood. It is a single Python file that uses Oxlo.ai function calling to generate ASCII diagrams and step-by-step traces of transformer blocks. If you can read Python and want to stop treating LLMs as black boxes, this tool gets you there.
What you'll need
- An Oxlo.ai API key from https://portal.oxlo.ai
- Python 3.10 or newer
- The OpenAI SDK:
pip install openai
Step 1: Bootstrap the client and system prompt
I keep the system prompt in a constant so I can tune tone without touching logic. The prompt instructs the model to explain transformer components with exact shapes and tensor dimensions.
SYSTEM_PROMPT = """You are a Staff ML Engineer explaining transformer architecture to a senior backend developer.
Rules:
- Always mention tensor shapes when describing matrices.
- Use tool calls to generate ASCII diagrams.
- Never hand-wave. If you do not know a specific architecture detail, say so."""
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
# Verify connectivity with a lightweight model
response = client.chat.completions.create(
model="deepseek-v3.2",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Say 'connected' and nothing else."},
],
)
print(response.choices[0].message.content)
Step 2: Define the tool schema
I give the model three tools so it chooses the right output format instead of dumping unstructured text.
TOOLS = [
{
"type": "function",
"function": {
"name": "draw_architecture",
"description": "Render an ASCII diagram of a transformer block or full model.",
"parameters": {
"type": "object",
"properties": {
"component": {
"type": "string",
"enum": ["attention", "mlp", "full_transformer", "kv_cache"],
},
"model_name": {"type": "string"},
},
"required": ["component"],
},
},
},
{
"type": "function",
"function": {
"name": "explain_math",
"description": "Show the key equation for a component with variable definitions.",
"parameters": {
"type": "object",
"properties": {
"equation_name": {
"type": "string",
"enum": ["softmax_attention", "rope", "rms_norm", "swiglu"],
},
},
"required": ["equation_name"],
},
},
},
{
"type": "function",
"function": {
"name": "trace_forward_pass",
"description": "List the operations in order for a single forward pass step.",
"parameters": {
"type": "object",
"properties": {
"step": {
"type": "string",
"enum": ["embedding", "attention", "mlp", "logits"],
},
},
"required": ["step"],
},
},
},
]
Step 3: Implement tool handlers
Each tool is a pure function that returns a string. I keep the diagrams text-based so the agent works in any terminal.
def draw_architecture(component: str, model_name: str = "") -> str:
if component == "kv_cache":
return """
+-----------+ +-------+ +--------+
| Input |---->| Query | | Key |
| Tokens | +-------+ +--------+
+-----------+ | |
v v
+----------+ +----------+
| Q @ K^T | | Cached |
+----------+ | K, V |
| +----------+
v |
+----------+ |
| Softmax |<--------+
+----------+
|
v
+----------+
| @ V |
+----------+
"""
if component == "attention":
return """
Input: X (batch, seq, dim)
|
v
+---------+ +---------+
| W_q | | W_k |
+---------+ +---------+
| |
v v
Q, K, V +----> RoPE
| |
v v
Q @ K^T ----> Mask ----> Softmax ----> @ V
"""
return f"Diagram for {component} not yet implemented."
def explain_math(equation_name: str) -> str:
if equation_name == "softmax_attention":
return "Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V"
if equation_name == "rms_norm":
return "RMS(x) = x / sqrt(mean(x^2) + epsilon)"
return "Equation not in database."
def trace_forward_pass(step: str) -> str:
if step == "attention":
return "1. Project to Q,K,V. 2. Apply RoPE. 3. Compute attention scores. 4. Softmax. 5. Weighted sum of V."
return "Trace not implemented."
Step 4: Build the conversation loop
I maintain a messages list and let the model decide when to call a tool. This keeps the interaction stateful without extra frameworks.
import json
def run_explainer():
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Explain how KV caching works in Llama 3.3 70B."},
]
while True:
response = client.chat.completions.create(
model="qwen-3-32b",
messages=messages,
tools=TOOLS,
tool_choice="auto",
)
message = response.choices[0].message
if message.content:
print("Assistant:", message.content)
if not message.tool_calls:
break
messages.append({
"role": "assistant",
"content": message.content or "",
"tool_calls": [
{
"id": tc.id,
"type": "function",
"function": {"name": tc.function.name, "arguments": tc.function.arguments},
}
for tc in message.tool_calls
],
})
for tc in message.tool_calls:
fn_name = tc.function.name
args = json.loads(tc.function.arguments)
if fn_name == "draw_architecture":
result = draw_architecture(**args)
elif fn_name == "explain_math":
result = explain_math(**args)
elif fn_name == "trace_forward_pass":
result = trace_forward_pass(**args)
else:
result = "Unknown tool."
print(f"Tool [{fn_name}]:", result)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result,
})
if __name__ == "__main__":
run_explainer()
Step 5: Add complexity levels
I inject a complexity marker into the system prompt so the same agent can explain RoPE to either a bootcamp grad or a CUDA kernel engineer.
COMPLEXITY = {
"beginner": "Explain using analogies. Avoid matrix math. Use food or workflow metaphors.",
"intermediate": "Use tensor shapes and PyTorch-style pseudocode. No CUDA details.",
"advanced": "Include stride patterns, kernel fusion opportunities, and numerical stability concerns.",
}
def set_complexity(level: str):
global SYSTEM_PROMPT
base = """You are a Staff ML Engineer explaining transformer architecture to a senior backend developer.
Rules:
- Always mention tensor shapes when describing matrices.
- Use tool calls to generate ASCII diagrams.
- Never hand-wave. If you do not know a specific architecture detail, say so."""
suffix = COMPLEXITY.get(level, COMPLEXITY["intermediate"])
SYSTEM_PROMPT = base + "\nAudience level:\n" + suffix
# Example usage before running the explainer
set_complexity("intermediate")
Run it
Save the script as explainer.py, set your YOUR_OXLO_API_KEY, and run:
python explainer.py
Example output:
Assistant: I will break down KV caching for Llama 3.3 70B step by step.
Tool [draw_architecture]:
+-----------+ +-------+ +--------+
| Input |---->| Query | | Key |
| Tokens | +-------+ +--------+
+-----------+ | |
v v
+----------+ +----------+
| Q @ K^T | | Cached |
+----------+ | K, V |
| +----------+
v |
+----------+ |
| Softmax |<--------+
+----------+
|
v
+----------+
| @ V |
+----------+
Assistant: During generation, we only compute a new Query vector for the latest token. We reuse the Key and Value tensors stored from previous steps. This reduces the attention computation from quadratic to linear in sequence length for the cache tensors.
Tool [explain_math]: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V
Assistant: That is the core equation. Because K and V are cached, the QK^T matrix is shape (1, seq_len) instead of (seq_len, seq_len) for each new token.
The agent stops when it has nothing left to look up. Because Oxlo.ai uses flat per-request pricing, iterating on long system prompts and multi-turn tool chains does not inflate your bill the way token-based pricing would. See https://oxlo.ai/pricing for details.
Wrap-up
Two concrete next steps. First, expose the complexity flag as a CLI argument so your teammates can run python explainer.py --level advanced --model llama-3.3-70b. Second, cache the tool responses in a local SQLite database so repeated questions about KV caching hit disk instead of the API.







