I Found 54 Reliability Issues in My 14-Agent AI System — Here's What Broke

Every testing tool for AI agents tests individual agents. But production failures don't happen inside agents — they happen between them.

I learned this the hard way.

The Problem Nobody Is Solving

I built a 14-agent document processing system using CrewAI. Each agent worked perfectly in isolation. In production, the system failed constantly — and I couldn't figure out why.

The problem wasn't any single agent. It was the interactions:

One agent failing silently took down 12 others
Agents were sharing sensitive data across boundaries they shouldn't cross
Three agents formed a communication clique that bypassed the orchestrator
Every agent depended on one central orchestrator with zero fallback

No existing tool could find these issues. Arize, Langfuse, Braintrust — they all monitor individual agents. None of them test the graph of agent interactions.

So I built one.

What I Built: swarm-test

swarm-test builds a NetworkX interaction graph of your multi-agent system and runs 6 chaos engineering tests against it:

Cascade Failure — which agents bring down the whole system if they fail
Context Leakage — sensitive data (API keys, PII, credentials) crossing agent boundaries
Intent Drift — agents acting outside their role or being manipulated
Collusion Detection — agents communicating outside the orchestrator's oversight
Blast Radius — single points of failure and critical dependency paths
Timeout Resilience — agents with no fallback if upstream is slow

3-line API:

from swarm_test import SwarmProbe

probe = SwarmProbe(crew)
report = probe.run_all()
report.print_summary()

What It Found On My Real System

I ran swarm-test on my 14-agent system. The results were brutal:

54 total findings:

15 CRITICAL (14 cascade failures + 1 SPOF)
13 HIGH (9 timeout vulnerabilities + 4 collusion cliques)
26 MEDIUM (13 intent drift + 13 missing timeout handling)

The worst agent: OrchestratorAgent scored 4 out of 100. It's a single point of failure with 92% blast radius — if it fails, 12 of 14 agents go down. And it had zero timeout handling.

The scariest finding: EvolutionAgent has 100% blast radius. If it fails, every other agent in the system is affected.

Three agents (OrchestratorAgent, FileOptimizerAgent, PrintOptimizerAgent) formed a collusion clique — communicating directly with each other and bypassing orchestrator oversight.

None of this was visible from testing individual agents. It only appeared when I tested the interaction graph.

I Shipped 7 Features in 7 Days

After launching, I shipped one feature every day:

Day	Feature	Impact
0	Launch — 5 chaos tests, GitHub + PyPI	First multi-agent testing tool on PyPI
1	Timeout resilience test	Found 22 new issues in my system
2	JSON export	Another developer integrated it into his runtime gate within hours
3	LangGraph adapter	Now supports CrewAI + LangGraph
4	Sensitive data detection (23 patterns)	Catches AWS keys, JWT tokens, credit cards crossing agent boundaries
5	Per-agent health scores (0-100)	Know exactly which agent to fix first
6	Before/after comparison	Measure if your refactor actually improved reliability
7	ASCII agent graph	See your agent topology right in the terminal

94 tests passing. Two frameworks supported. And growing.

The First Integration

Within 48 hours of launch, another developer built an integration. He has a runtime action-gate that blocks dangerous agent actions before execution. He connected swarm-test's findings as "priors" — so when swarm-test flags an edge as high-risk, his gate becomes more cautious on that edge.

The result: the same run_sql action went from "CONFIRM" (risk 62) to "HUMAN_REQUIRED" (risk 78) when swarm-test's cascade finding was attached.

Structural testing (swarm-test) + runtime enforcement (his gate) = the full reliability stack for multi-agent systems.

Why This Matters Now

According to recent industry research:

88% of organizations report AI agent security incidents
Only 14.4% of agents go live with full security approval
OWASP classified cascade failures as ASI08 — a top AI security risk

Multi-agent systems are going to production faster than anyone can secure them. The tools exist for single-agent monitoring. Nothing existed for multi-agent interaction testing — until now.

Try It

pip install swarm-test

from swarm_test import SwarmProbe

# Works with CrewAI
probe = SwarmProbe(your_crew)
report = probe.run_all()
report.print_summary()
report.to_html("report.html")  # Interactive D3 graph
report.to_json("report.json")  # Machine-readable for CI/CD