Every testing tool for AI agents tests individual agents. But production failures don't happen inside agents — they happen between them.
I learned this the hard way.
The Problem Nobody Is Solving
I built a 14-agent document processing system using CrewAI. Each agent worked perfectly in isolation. In production, the system failed constantly — and I couldn't figure out why.
The problem wasn't any single agent. It was the interactions:
- One agent failing silently took down 12 others
- Agents were sharing sensitive data across boundaries they shouldn't cross
- Three agents formed a communication clique that bypassed the orchestrator
- Every agent depended on one central orchestrator with zero fallback
No existing tool could find these issues. Arize, Langfuse, Braintrust — they all monitor individual agents. None of them test the graph of agent interactions.
So I built one.
What I Built: swarm-test
swarm-test builds a NetworkX interaction graph of your multi-agent system and runs 6 chaos engineering tests against it:
- Cascade Failure — which agents bring down the whole system if they fail
- Context Leakage — sensitive data (API keys, PII, credentials) crossing agent boundaries
- Intent Drift — agents acting outside their role or being manipulated
- Collusion Detection — agents communicating outside the orchestrator's oversight
- Blast Radius — single points of failure and critical dependency paths
- Timeout Resilience — agents with no fallback if upstream is slow
3-line API:
from swarm_test import SwarmProbe
probe = SwarmProbe(crew)
report = probe.run_all()
report.print_summary()
What It Found On My Real System
I ran swarm-test on my 14-agent system. The results were brutal:
54 total findings:
- 15 CRITICAL (14 cascade failures + 1 SPOF)
- 13 HIGH (9 timeout vulnerabilities + 4 collusion cliques)
- 26 MEDIUM (13 intent drift + 13 missing timeout handling)
The worst agent: OrchestratorAgent scored 4 out of 100. It's a single point of failure with 92% blast radius — if it fails, 12 of 14 agents go down. And it had zero timeout handling.
The scariest finding: EvolutionAgent has 100% blast radius. If it fails, every other agent in the system is affected.
Three agents (OrchestratorAgent, FileOptimizerAgent, PrintOptimizerAgent) formed a collusion clique — communicating directly with each other and bypassing orchestrator oversight.
None of this was visible from testing individual agents. It only appeared when I tested the interaction graph.
I Shipped 7 Features in 7 Days
After launching, I shipped one feature every day:
| Day | Feature | Impact |
|---|---|---|
| 0 | Launch — 5 chaos tests, GitHub + PyPI | First multi-agent testing tool on PyPI |
| 1 | Timeout resilience test | Found 22 new issues in my system |
| 2 | JSON export | Another developer integrated it into his runtime gate within hours |
| 3 | LangGraph adapter | Now supports CrewAI + LangGraph |
| 4 | Sensitive data detection (23 patterns) | Catches AWS keys, JWT tokens, credit cards crossing agent boundaries |
| 5 | Per-agent health scores (0-100) | Know exactly which agent to fix first |
| 6 | Before/after comparison | Measure if your refactor actually improved reliability |
| 7 | ASCII agent graph | See your agent topology right in the terminal |
94 tests passing. Two frameworks supported. And growing.
The First Integration
Within 48 hours of launch, another developer built an integration. He has a runtime action-gate that blocks dangerous agent actions before execution. He connected swarm-test's findings as "priors" — so when swarm-test flags an edge as high-risk, his gate becomes more cautious on that edge.
The result: the same run_sql action went from "CONFIRM" (risk 62) to "HUMAN_REQUIRED" (risk 78) when swarm-test's cascade finding was attached.
Structural testing (swarm-test) + runtime enforcement (his gate) = the full reliability stack for multi-agent systems.
Why This Matters Now
According to recent industry research:
- 88% of organizations report AI agent security incidents
- Only 14.4% of agents go live with full security approval
- OWASP classified cascade failures as ASI08 — a top AI security risk
Multi-agent systems are going to production faster than anyone can secure them. The tools exist for single-agent monitoring. Nothing existed for multi-agent interaction testing — until now.
Try It
pip install swarm-test
from swarm_test import SwarmProbe
# Works with CrewAI
probe = SwarmProbe(your_crew)
report = probe.run_all()
report.print_summary()
report.to_html("report.html") # Interactive D3 graph
report.to_json("report.json") # Machine-readable for CI/CD
GitHub: github.com/surajkumar811/swarm-test
Open source. MIT licensed. Solo founder building in public.
What reliability tests would YOU want for your multi-agent systems? Drop a comment — I'm shipping features based on real feedback.













