How I Keep 31 AI Agents from Shipping Garbage

Bilal El Alamy9 min read
AI quality controlagent reliabilitycircuit breakersmulti-agent AI

Part 3 of 4 β€” Part 1: Why 31 Agents Β· Part 2: Two-Layer Architecture Β· Part 3: Quality Control Β· Part 4: Costs & Memory


Part 1 covered why I need 31 agents. Part 2 covered the two-layer architecture that coordinates them. But architecture without quality control is just organized chaos. Here's how I keep 31 agents from shipping garbage.

The Problem Nobody Warns You About

The first time I let agents hand off work to each other without checks, the output looked fine on the surface. An agent researched a market. Another agent turned that research into a PRD. A third agent designed screens based on the PRD.

By the time the code was written, the original market insight had been distorted through three handoffs. Like a copy of a copy of a copy β€” each agent interpreted the previous output through its own context, and by the end, the product had drifted from the actual opportunity.

Context degradation across agent handoffs. That's the failure mode nobody talks about.

The fix wasn't better prompts. It was structural. Every handoff needed a quality gate. Every phase needed a human kill switch. Every agent needed to prove its work before claiming "done."

4 Human Gates. Zero Agent Self-Approval.

The entire product cycle runs through 7 phases and 4 human gates. Every gate is a kill switch.

View details
GATE 0: Human provides idea or direction
    ↓
PHASE 1: Bill researches (market, competitors, validation)
    ↓
🚫 GATE 1: Human reviews research β†’ GO / KILL
    ↓
PHASE 2: Vivi creates PRD + architecture (with Bill's inputs)
    ↓
🚫 GATE 2: Human reviews PRD β†’ GO / KILL
    ↓
PHASE 3: Jack designs (wireframes, components, brand)
    ↓
🚫 GATE 3: Human reviews design β†’ GO / KILL
    ↓
PHASE 4: Thom builds (frontend, backend, devops)
PHASE 5: Nico reviews code + Pepe tests
    ↓
🚫 GATE 4: Human reviews launch readiness β†’ GO / KILL
    ↓
PHASE 6: Peiy launches (content, SEO, outreach)
PHASE 7: Vivi runs retrospective

No agent can approve moving to the next phase. Not Alfrawd. Not Vivi. Not anyone. I don't care how confident the agent is. I don't care if the research looks perfect. Human eyes at every gate, every time.

This isn't a trust issue. It's a structural guarantee. Agents execute. Humans decide.

The timing adapts to complexity. A simple product β€” clear market, familiar tech stack β€” moves through all 7 phases in 2 days. A complex product with novel architecture or regulatory requirements takes up to 2 weeks. For trivial tasks, prefix with quick: and it collapses to research brief β†’ build β†’ ship, same day.

But regardless of timeline, the gates don't get skipped.

Cascading Validation: Every Handoff Has a Quality Check

The human gates catch strategic drift. But there are dozens of agent-to-agent handoffs between gates. Each one is a point where context can degrade.

So the receiving agent validates every handoff before starting work.

View details
Bill completes research
  β†’ Vivi checks: Sources cited? Market data present?
      Competitor analysis included?
    β†’ YES β†’ Vivi starts PRD
    β†’ NO  β†’ Structured rejection with specific gaps

Jack completes design
  β†’ Thom checks: Wireframes for all screens?
      Component specs? Responsive breakpoints?
    β†’ YES β†’ Thom spawns builders
    β†’ NO  β†’ Rejection with specific items missing

Thom completes code
  β†’ Nico checks: Build passes? No critical TODOs?
      Tests present?
    β†’ YES β†’ Nico reviews
    β†’ NO  β†’ Rejection with specific failures

Nico approves code
  β†’ Pepe checks: Review approved? No open revisions?
    β†’ YES β†’ Pepe runs test suite
    β†’ NO  β†’ Back to Nico

This is what kills the copy-of-a-copy problem. If Bill's research is missing competitor data, Vivi catches it before writing a PRD that silently ignores the competitive landscape. If Jack's wireframes don't cover mobile, Thom catches it before building a desktop-only app. The validation happens at the boundary, not after the damage propagates downstream.

Max 2 rejection loops. If an agent can't fix the issue after two attempts, it escalates β€” to the orchestrator, then to Alfrawd, then to me. Two tries is enough to fix a genuine gap. If it takes three, the problem is usually a misunderstanding that needs human judgment, not another iteration.

Self-Testing Is Mandatory

"I'm done" is not accepted without proof.

Every agent verifies its own output before signaling completion. This isn't optional. It's baked into the task protocol.

Code agents (Thom's sub-agents) run build, lint, and test before marking a task complete. If the build fails, they fix it. If tests fail, they fix them. The orchestrator never sees "done" unless the automated checks passed.

Design agents (Jack's sub-agents) check that all screens in the spec are present, responsive variants exist, and the component library is consistent. Missing a mobile view? Not done.

Research agents (Bill) verify that every claim has a cited source, market sizing has methodology, and competitor analysis covers the defined set. "I think the TAM is $2B" without a source chain? Not done.

QA agents (Pepe's sub-agents) run the full test suite and produce a certification report. No report, no certification. No certification, no launch.

Self-testing catches the obvious failures before they waste anyone else's time. The cascading validation catches the subtle ones.

Trust Scoring: How Agents Earn β€” or Lose β€” Autonomy

Not all agents perform equally. Some are consistently reliable. Some need babysitting. The system tracks this.

Every agent has a trust score in the task registry:

jsonShow code
{
  "agentScores": {
    "thom-frontend": {
      "totalTasks": 47,
      "completed": 43,
      "failed": 4,
      "successRate": 0.91,
      "streakCurrent": 8,
      "byPhase": {
        "development": { "total": 38, "success": 36, "rate": 0.95 },
        "hotfix":      { "total": 9,  "success": 7,  "rate": 0.78 }
      }
    }
  }
}

Total tasks, success rate, current streak, broken down by phase. The numbers tell you exactly where an agent is strong and where it's shaky.

Trust scores map to oversight tiers:

ScoreTierTreatment
> 0.9High trustDelegate freely, minimal check-ins
0.7 – 0.9StandardNormal validation at every handoff
0.5 – 0.7Low trustExtra checks, consider rerouting
< 0.5Auto-rerouteWork automatically sent to fallback agent

An agent at 0.95 has earned lighter oversight. An agent at 0.6 gets closer scrutiny. An agent below 0.5 doesn't get work at all β€” it's routed to a fallback automatically.

The byPhase breakdown matters. thom-frontend might be excellent at development (0.95) but unreliable at hotfixes (0.78). The system knows the difference. Trust isn't binary β€” it's contextual.

13 Fallback Pairs: Every Agent Has a Backup

Every agent in the system has a defined fallback. 13 fallback pairs across the entire operation. If thom-frontend is unreliable or its circuit breaker trips, UI work routes to thom-backend. If pepe-general can't run tests, pepe-web3 picks up.

These aren't perfect substitutes. A backend agent doing frontend work won't be as good as the specialist. But a slightly worse output is infinitely better than a stalled pipeline.

The fallback routing is automatic. When an agent drops below 0.5 trust or its circuit breaker opens, the orchestrator reroutes without waiting for human intervention. Speed matters more than optimality when the alternative is a blocked workflow.

Circuit Breakers: Failing Fast Instead of Burning Tokens

Sometimes an agent doesn't just underperform β€” it breaks. A misconfigured tool, a rate-limited API, a corrupted context. When that happens, repeated retries don't help. They just burn tokens and time.

Circuit breakers stop the bleeding.

View details
CLOSED (normal operation)
  β”‚
  β”œβ”€β”€ Agent fails 3 times within 60 seconds
  β”‚
  β–Ό
OPEN (no work sent to agent)
  β”‚
  β”œβ”€β”€ 5-minute cooldown
  β”‚
  β–Ό
HALF-OPEN (send one test task)
  β”‚
  β”œβ”€β”€ Test passes β†’ CLOSED (resume normal)
  └── Test fails  β†’ OPEN (restart cooldown)

Three failures in 60 seconds trips the breaker. The agent stops receiving work for 5 minutes. After cooldown, it gets one test task. Pass? Back to normal. Fail? Another 5-minute cooldown.

Borrowed from microservices reliability engineering. The exact same pattern that keeps Netflix streaming when a service goes down. It works just as well for AI agents.

Without circuit breakers, a broken agent creates a cascading mess. The orchestrator keeps assigning work, the agent keeps failing, tokens keep burning, and downstream agents starve for input. The breaker isolates the failure in seconds.

How It All Fits Together

These aren't independent systems. They're layers:

  1. Self-testing catches failures at the source before any handoff
  2. Cascading validation catches quality issues at every agent-to-agent boundary
  3. Human gates catch strategic drift at the 4 critical decision points
  4. Trust scoring adapts oversight based on proven reliability
  5. Fallback pairs keep work flowing when an agent is unreliable
  6. Circuit breakers isolate catastrophic failures instantly

It's not elegant. It's paranoid. That's why it works.

An agent that passes its own self-tests, survives cascading validation from the receiving agent, clears the human gate, and maintains a trust score above 0.9? That agent's output is reliable. Not because any single check is perfect, but because passing all of them is genuinely hard.

The system doesn't trust agents. It verifies them. Continuously, structurally, at every level.

The Uncomfortable Truth About AI Agent Quality Control

Most AI agent quality control advice is "write better prompts" or "add a reviewer agent." That's necessary but not sufficient. Prompts drift. Reviewer agents have their own failure modes. The only thing that reliably catches garbage is structural redundancy β€” multiple independent checks at multiple levels, with a human holding the kill switch at every strategic decision point.

31 agents producing unchecked output would ship garbage at scale. 31 agents running through this gauntlet ship work I can trust.


Quality systems keep the output reliable. But reliability means nothing if the system costs $6,000/month and forgets everything between sessions. Part 4 covers the economics β€” model tiering that cut costs 80%, the 5-layer memory architecture, and the nightly learning loop that makes the whole system structurally better every morning.

Ready to delegate?

Start free with your own AI team. No credit card required.

How I Keep 31 AI Agents from Shipping Garbage β€” Mr.Chief