How Did You Last Validate Your Skill?
You finished writing a skill, triggered it manually a couple of times, the output looked reasonable — and then you shipped it.
That's probably the full validation workflow for most people. Slightly embarrassing to admit, but true. We write unit tests and run CI for regular code. But when it comes to Skills, we somehow regress to the era of "going by feel."
The problem isn't laziness. The problem is we don't have a clear picture of how Skills quietly fail — and we don't have a shared vocabulary for what "good" even means.
This article addresses both. First, we'll map out the failure paths. Then we'll use that failure map to reverse-engineer a validation system.
How Do Skills Fail?
Before talking about how to test, let's think through what can go wrong. Skill failures typically follow four paths — and they tend to be quiet. No loud errors, just results that are "a bit off."
Path 1: The Skill Never Triggered
This is the most invisible failure. The user said "format my code," but the Agent never invoked your code-formatting Skill — it just used its own knowledge to make some changes that seemed reasonable.
The root cause is usually a vague description field in SKILL.md. Too broad and it conflicts with other Skills; too narrow and it misses a wide range of legitimate triggers. The tricky part: this failure is nearly impossible to catch in manual testing, because you test using the most canonical trigger phrases, while real users express the same intent in a thousand different ways.
Path 2: Triggered, But the Task Wasn't Completed
The Skill was invoked, tools were called, but the job didn't get done. Maybe three files were supposed to be created and only two were. Maybe a migration script started but exited early.
This is Outcome failure — the most direct and impactful type. Users see the result. If the result is wrong, the Skill might as well not exist.
Path 3: The Right Result, the Wrong Path
This one is subtler. The final output looks fine, but the execution path was wrong: the wrong tools were called, the steps happened out of order, or the Agent took a long detour to get there.
Example: a database migration Skill where the correct sequence is "backup → migrate → verify." If the Agent migrates first and backs up second, the output files might look identical — but the next time a migration fails, you have no usable backup. This is Process failure, completely invisible to result-only validation.
Path 4: Completed, But Below Quality Bar
The task finished. The process was correct. But: the generated code doesn't match the project's style conventions. The commit message format doesn't follow your team's standards. The task used 500 tokens when 100 would have sufficed.
This is Style and Efficiency failure. It won't throw an error. It accumulates silently as technical debt, team friction, and rising costs.
Defining "Success": Four Validation Dimensions
These four failure paths map directly to four success criteria. Until you've defined all four, you haven't really specified what the Skill is supposed to do.
| Dimension | Corresponding Failure | Core Question |
|---|---|---|
| Outcome | Task not completed | Did it do what it was supposed to do? |
| Process | Wrong execution path | Were the right tools used in the right order? |
| Style | Quality below bar | Does the output conform to conventions? |
| Efficiency | Wasted resources | Any unnecessary detours? Reasonable token usage? |
Here's how to validate each one.
Validating Outcome: Deterministic Checks
Outcome is the most quantifiable dimension — best validated with deterministic graders: parse the run log or inspect filesystem state to confirm whether the task completed.
Build a Small Test Set First
You don't need hundreds of test cases. Ten to twenty is enough — but they need to cover three types:
Explicit trigger: "/use code-formatter please format this file"
Implicit trigger: "this code looks messy, can you clean it up?"
Negative control: "write me a sorting algorithm" (should NOT trigger the formatter)
Negative controls are especially important. They check whether the Skill is being triggered when it shouldn't be — over-triggering is just as much a problem as never triggering.
Deterministic Assertions on JSON Output
codex exec --json produces a structured JSONL run log containing details of every tool call. For Outcome validation:
import json, sys
# Load run log
with open("run_output.jsonl") as f:
events = [json.loads(line) for line in f]
# Check whether target files were created
created_files = [
e["path"] for e in events
if e.get("type") == "file_write"
]
expected = ["src/index.ts", "src/types.ts", "README.md"]
missing = [f for f in expected if f not in created_files]
if missing:
print(f"❌ FAIL: The following files were not created: {missing}")
sys.exit(1)
else:
print("✅ PASS: All expected files were created")
The advantage here: these checks are deterministic — no model judgment involved, results are stable and reproducible. Keep Outcome-level Evals this way. Save the subjective assessment for Rubric scoring later.
Validating Process: Tool Call Sequence Verification
Outcome checks tell you "was the job done?" Process checks tell you "how was it done?"
For Skills with defined execution ordering requirements, you need to verify the sequence of tool calls:
# Extract all tool calls in order
tool_calls = [
e["tool"] for e in events
if e.get("type") == "tool_use"
]
# Define the expected call sequence
expected_sequence = ["db_backup", "db_migrate", "db_verify"]
# Check whether it appears as a subsequence (other tools allowed between)
def is_subsequence(expected, actual):
it = iter(actual)
return all(step in it for step in expected)
if not is_subsequence(expected_sequence, tool_calls):
print(f"❌ FAIL: Tool call sequence doesn't match expected")
print(f" Expected to contain: {expected_sequence}")
print(f" Actual calls: {tool_calls}")
sys.exit(1)
else:
print("✅ PASS: Tool call sequence is correct")
Process validation has a more advanced use: detecting command thrashing — the Agent repeatedly retrying the same operation, bouncing back and forth. This usually signals that the Skill's instructions are ambiguous enough that the Agent is flailing. Detect it by counting consecutive repeated calls:
from itertools import groupby
for tool, group in groupby(tool_calls):
count = sum(1 for _ in group)
if count > 3:
print(f"⚠️ WARNING: '{tool}' called {count} times consecutively — possible thrashing")
Validating Style and Efficiency: Rubric-Based Model Scoring
Outcome and Process are factual checks — pass or fail, black or white. Style and Efficiency are qualitative judgments: is the code style right? Is the commit message format correct? Did the Agent take unnecessary detours? These don't have a single right answer.
This is where you switch tools: let another model score the output — but give it a clear rubric and use --output-schema to force structured JSON responses, making scores comparable across runs.
Define Your Rubric
# rubric.yaml
criteria:
- name: code_style
description: "Does the generated code conform to the project's ESLint rules?"
scale: [1, 5]
anchor_1: "Totally non-conformant, many violations"
anchor_5: "Fully conformant, zero violations"
- name: commit_format
description: "Does the commit message follow Conventional Commits specification?"
scale: [1, 5]
anchor_1: "Format completely wrong"
anchor_5: "Fully correct — type, scope, and description all proper"
- name: efficiency
description: "Did the Agent have obvious redundant steps or unnecessary tool calls?"
scale: [1, 5]
anchor_1: "Lots of redundancy, chaotic execution"
anchor_5: "Clean and efficient execution path"
Force Structured Output with --output-schema
import subprocess, json
output_schema = {
"type": "object",
"properties": {
"code_style": {"type": "integer", "minimum": 1, "maximum": 5},
"commit_format": {"type": "integer", "minimum": 1, "maximum": 5},
"efficiency": {"type": "integer", "minimum": 1, "maximum": 5},
"reasoning": {"type": "string"}
},
"required": ["code_style", "commit_format", "efficiency", "reasoning"]
}
result = subprocess.run(
["codex", "exec", "--output-schema", json.dumps(output_schema),
"--", "Evaluate the following run output for code style compliance..."],
capture_output=True, text=True
)
scores = json.loads(result.stdout)
print(f"Code style: {scores['code_style']}/5")
print(f"Commit format: {scores['commit_format']}/5")
print(f"Efficiency: {scores['efficiency']}/5")
print(f"Reasoning: {scores['reasoning']}")
The core value of structured output: cross-version comparability. You adjust a line in the Skill's instructions, re-run the Eval, and the Style score goes from 3.2 to 4.1. That's a trustworthy improvement signal — not "it feels better now."
Progressive Stacking: Let Your Evals Grow with Your Skill
You don't need to build all four dimensions at once. Skill validation should iterate alongside the Skill itself.
Phase 1 (Skill just written): Run manual tests, confirm basic Outcome looks right.
Phase 2 (Ready to share): Add deterministic Outcome checks, build 10 test cases.
Phase 3 (Team is using it): Add Process sequence validation, add Style Rubric scoring.
Phase 4 (Production-critical path): Add command thrashing detection, token usage monitoring, build validation, runtime smoke tests.
This progressive approach has one key benefit: you build the Eval habit when the Skill is simplest, without getting blocked waiting to build the full system. A Skill with two Outcome checks is meaningfully safer than one with none.
As your Eval suite matures, wire it into your CI pipeline so every Skill change triggers an automatic run. At that point, you're iterating on Skills with real confidence — not gambling on "this should be fine."
Summary
Back to the opening question: is your Skill actually good?
Now we have a framework to answer it:
| You want to know… | Use this |
|---|---|
| Was the task completed? | Deterministic checks (parse JSONL, verify files/state) |
| Were the steps correct? | Tool call sequence validation + thrashing detection |
| Was the quality good? | Rubric model scoring (structured JSON output) |
| Was anything wasted? | Token usage tracking + redundant step detection |
Good Evals do two things: make regressions clear — you know exactly what change caused the score to drop; and make failures explainable — not "something feels off" but "the tool call sequence was wrong at step 3."
That's what gives you the confidence to keep improving your Skills without second-guessing every change.
Source: Core methodology from the OpenAI developer blog — Testing Agent Skills Systematically with Evals.













