How to Run 31 AI Agents in Production: The Architecture That Actually Works
A practitioner's guide to multi-agent orchestration β from glorified chatbots to a self-improving system.
You Don't Have an AI Agent Problem. You Have an Orchestration Problem.
Setting up one AI agent is easy. Give it a system prompt, connect some tools, and let it run. It'll work great β for about a week.
Then the cracks show:
- It makes the same mistake it made last Tuesday
- Two agents contradict each other and nobody notices
- A sub-agent says "done" but didn't actually verify anything
- You're burning $200/day on premium model calls that a cheaper model could handle
- You have no idea which agent is actually producing value
I know because every single one of these happened to us. We run 31 AI agents across 8 teams β product development, investment management, regulatory compliance, marketing, security, and infrastructure. All orchestrated through a single system.
Six months ago, they were glorified chatbots with fancy titles. Today, they ship production code, catch each other's mistakes, and get measurably better every night while we sleep.
Here's the architecture that made it work.
Layer 1: The Task Registry (Know What's Happening)
The first thing that breaks at scale is visibility. "What's happening right now?" becomes a surprisingly hard question when 31 agents are working simultaneously.
The fix: A live JSON registry (tasks.json) tracking every active task across all agents β who's working on what, what phase it's in, what's blocked, what's done.
jsonShow code
{
"taskId": "PRD-042",
"agent": "vivi",
"product": "AlphaNet",
"phase": "design-handoff",
"status": "blocked",
"blockedBy": "jack β missing responsive breakpoints",
"pr": null,
"checks": { "codeReview": null, "qa": null }
}
Every agent reads and updates the registry on spawn and completion. During a product sprint, instead of asking four different agents what's blocking the launch, one query to the registry shows the answer: frontend is done, review is approved, but QA is blocked waiting for a smart contract patch.
Without the registry: 4 separate queries, 4 separate context switches, minutes of coordination overhead. With the registry: One query. Instant answer. Unblock in minutes.
Layer 2: Cascading Validation (Stop Context Degradation)
In long chains of agent work, context degrades like a copy of a copy. By the fifth handoff, the original goal is distorted.
The fix: Every sub-task has typed expected inputs and expected outputs. Task N+1 cannot start until it validates that Task N's output matches what it needs. If invalid, it sends a structured rejection.
The upstream validation chain:
View details
Research β PRD: Must have sources, market data, competitor analysis
PRD β Design: Must have features, user stories, acceptance criteria
Design β Code: Must have wireframes, component specs, responsive breakpoints
Code β Review: Must have PR, clean build, no critical TODOs
Review β QA: Must have approval, no open revision requests
QA β Launch: Must have certification, zero critical/major bugs
Real impact: Our design agent produced a component library but forgot mobile responsive specs. Instead of the frontend agent discovering this mid-build (3 hours wasted), cascading validation caught it at handoff: "Missing: responsive breakpoints for screens <768px." Fixed in 10 minutes. Total time saved: roughly 3 hours plus the debugging that would have followed.
Layer 3: Circuit Breakers (Isolate Failures)
Without circuit breakers, a broken agent keeps receiving work, keeps failing, and keeps consuming tokens. Worse β it cascades. Broken code β broken review β blocked QA β launch delay.
The fix: If an agent fails repeatedly within a short window, the system stops sending it work.
Three states:
- Closed (normal operation) β work flows through
- Open (tripped) β all new tasks fail fast, agent enters cooldown
- Half-open (recovery) β one test task. Pass = resume. Fail = re-open.
Thresholds vary by agent type: code agents trip after 3 failures in 60 seconds. Finance agents trip after 2 (higher sensitivity for financial operations).
Real impact: An API rate limit caused our frontend agent to fail three consecutive build tasks. Circuit breaker tripped after the third failure, waited 5 minutes (rate limit reset), then successfully completed a test task and resumed normal operation. Total wasted work: 3 tasks. Without the breaker: potentially dozens, plus cascading failures downstream.
Layer 4: Model Differentiation (Stop Overpaying)
Running 31 agents on the most expensive model is financial suicide. But downgrading everything to a cheap model tanks quality.
The fix: Match model capability to task requirement.
| Tier | Model | Agent Count | Use Case |
|---|---|---|---|
| Premium | Claude Opus 4.6 (1M context) | 22 | Orchestrators, complex reasoning, cross-domain synthesis |
| Standard | Claude Sonnet 4.6 (200K) | 8 | Execution sub-agents β content, SEO, QA |
| Specialized | Grok fast/code (131K) | 4 | Investment data feeds, code review |
Our SEO agent was running on Opus at ~$2 per analysis. Switching to Sonnet dropped that to ~$0.40 with zero measurable quality difference β SEO analysis is structured and rule-based, not requiring deep reasoning.
Across 8 execution sub-agents running multiple tasks daily, this saves hundreds per month while orchestrators keep their full reasoning power.
The key insight: An SEO writer doesn't need the same reasoning power as a regulatory compliance analyst. Match the tool to the job.
Layer 5: Self-Test Rules (Make "Done" Mean Something)
"I'm done" without proof is worthless. A code agent that says "done" but didn't run the build is lying. A research agent that says "done" but cited zero sources is guessing.
The fix: Every agent type has explicit self-verification criteria:
| Agent Type | Must Verify Before "Done" |
|---|---|
| Code agents | Build passes, lint clean, tests pass |
| Design agents | All screens present, responsive specs, component list |
| Research agents | Sources cited, competitors β₯3, market size estimated |
| QA agents | Test plan executed, all critical paths, bug list with severity |
| Marketing agents | Copy proofread, links valid, CTA present |
Real impact: Our backend agent marked a Django API task as "done." Self-test rules required running the test suite. The suite revealed a failing authentication test that would have broken the login flow in production. Caught in self-test, not in QA. Not in production.
Layer 6: Proactive Monitoring (Don't Wait to Be Asked)
Reactive agents only work when you remember to ask them. That's not a team β that's a set of tools you have to manually pick up.
The fix: 10 active cron jobs that scan for issues, surface alerts, and trigger work automatically.
| Job | Schedule | What It Does |
|---|---|---|
| Morning Briefing | 6am weekdays | Disk + calendar + email scan |
| Email Triage | 4x daily | Urgent email detection |
| Portfolio Alert | 8am/8pm weekdays | Market movers |
| Disk Monitor | 2x daily | Alert if >80% usage |
| Nightly Extraction | 11pm daily | Learning loop + memory maintenance |
| Weekly Retrospective | Friday 5pm | Cycle summary and learnings |
Real impact: The email triage caught a compliance email from our bank requiring urgent action on a β¬420K capital deposit β origin documents needed for a regulatory deadline. Without the triage, this email would have sat unread. The agent flagged it, the human acted same morning.
Layer 7: The Agent Performance Scorecard (Measure Everything)
You can't improve what you can't measure.
The fix: An automated weekly report measuring every agent's performance: success rate, failure rate, consecutive errors, average token usage, average runtime, estimated cost per successful task β broken down by model tier.
bashShow code
python3 scripts/agent-scorecard.py generate --days 7
Output: per-agent performance tables, model efficiency comparisons, downgrade candidates, broken agents, idle agents, and cron coverage gaps.
Real impact: The first scorecard revealed that 272 out of 298 cron runs couldn't be attributed to specific agents β the "unknown" bucket. This wasn't a scoring problem; it was a metadata problem. Agent names weren't being propagated through cron job labels. One config fix later, agent attribution jumped to 90%+. The scorecard exposed a data quality issue nobody knew existed.
The Compound Effect
Each layer works independently. Together, they create something qualitatively different: a system that learns from failures (regressions list), catches contradictions (friction detection), validates its own work (cascading validation), prevents duplicate actions (idempotency keys), isolates failures (circuit breakers), measures its own performance (scorecard), and improves every night (nightly extraction).
The question isn't "how smart is your agent?" It's "how fast is your agent learning?"
In six months, the system with learning loops will be unrecognizably better than the one without β regardless of where they started.
What We Built From This
Every improvement above was born from a real failure running 31 agents for our AI startup studio. We took these lessons and built Mr.Chief β an AI Chief of Staff that lives in your messages. All of this architecture, packaged so nobody else has to implement 47 improvements themselves.
No server to manage. No dashboard to learn. Just open WhatsApp and start delegating.
Because the best AI architecture is the one you don't have to build.
This article is part of a series on production AI agent architecture. See also: How to Secure 31 AI Agents Without Lobotomizing Them, Why Your AI Agent Has Amnesia, The Real Cost of Running AI Agents.
Related posts
How I Keep 31 AI Agents from Shipping Garbage
4 human gates, cascading validation, trust scoring, and circuit breakers β the AI agent quality control system that keeps 31 agents from shipping garbage. Real patterns, real failures, real fixes.
9 min read
The Two-Layer Architecture That Stopped My AI Agents from Collapsing
Flat orchestration collapsed in two weeks. Here's the two-layer multi-agent architecture β master agent for routing, domain orchestrators for execution β that actually scales to 31 AI agents.
9 min read
Why I Needed 31 AI Agents (And Why 1 Wasn't Enough)
One AI agent hit walls in a week β context overflow, domain dilution, single point of failure. Here's why I built a multi-agent system with 31 specialized agents organized into 8 teams.
10 min read
Ready to delegate?
Start free with your own AI team. No credit card required.