I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.
As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.
So when I started working with AI agent skills, I noticed something: nobody was testing them.
You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.
There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.
That's a QA problem. I built opencode-skill-creator to solve it.
Then I dogfooded it on a real project. Here's what happened.
The Project: AdLoop Skills for Google Ads
AdLoop is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.
I created 4 skills for AdLoop using opencode-skill-creator:
- adloop-planning — Keyword research, competition analysis, and budget forecasting
- adloop-read — Performance analysis, campaign reporting, and conversion diagnostics
- adloop-write — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)
- adloop-tracking — GA4 event validation, conversion tracking diagnosis, and code generation
The Benchmark Results
opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:
- With skill loaded — the AI has full domain knowledge, safety rules, and orchestration patterns
- Without skill — the AI only has bare MCP tool names and descriptions
| Skill | Evals | With Skill | Without Skill | Improvement |
|---|---|---|---|---|
| adloop-write | 8 | 100% | 17% | +83pp |
| adloop-planning | 6 | 100% | 21% | +79pp |
| adloop-read | 8 | 100% | 27% | +73pp |
| adloop-tracking | 6 | 100% | 33% | +67pp |
But the raw numbers only tell part of the story. The failures without skills aren't just wrong answers — they're dangerous actions.
The Scariest Failure: Real Money at Stake
adloop-write manages campaigns, ads, keywords, and budgets — operations that spend real money. Without the skill:
- Added BROAD match keywords to MANUAL_CPC campaigns — the #1 cause of wasted ad spend
- Set budget above safety caps ($100 when max is $50) — no guardrail
- Deleted campaigns irreversibly without warning — no confirmation, no pause alternative
- Batched multiple changes in one call — bypassing review steps
This isn't about "better answers." This is about preventing real financial harm.
GDPR ≠ Broken Tracking
A common scenario: 500 clicks in Google Ads, 180 sessions in GA4. "Is my tracking broken?"
Without the skill, AI diagnosed this as a tracking issue and offered to investigate.
With the skill, AI recognized: "A 2.8:1 ratio is normal with GDPR consent banners. Google Ads counts all clicks. GA4 only counts consenting users. Your tracking is fine."
The #1 false positive in digital marketing analytics, prevented by domain knowledge in the skill.
Don't Trust Google Blindly
Without the skill, AI endorsed Google's recommendations at face value: "Raise budget" with zero conversions. "Add BROAD match" without Smart Bidding.
The skill explicitly states: "Google recommendations optimize for Google's revenue, not yours." It cross-references against conversion data first. The 73% improvement comes from teaching critical thinking, not compliance.
Why This Matters
The same AI model. The same tools. The same prompts. The only variable: whether the skill is loaded. The difference is 67–83 percentage points.
Skills do three things bare tool access doesn't:
- Inject domain expertise — GDPR mechanics, budget rules, competition levels
- Enforce safety guardrails — budget caps, deletion warnings, one-change-at-a-time
- Provide orchestration patterns — when to call which tool, in what order, with what validation
Try It Yourself
npx opencode-skill-creator install --global
Free, open source (Apache 2.0). Works with any of OpenCode's 300+ supported models. Pure TypeScript, zero Python dependency.
→ github.com/antongulin/opencode-skill-creator
Skills are software. Software should be tested.
Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at anton.qa or on LinkedIn.









![Defluffer - reduce token usage 📉 by 45% using this one simple trick! [Earthday challenge]](https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiekbgepcutl4jse0sfs0.png)


