How do you build AI agents that are reliable, observable, and composable in production?
Most AI agent demonstrations look compelling and break under real workloads. The gap between a convincing demo and a production-grade agent system is substantial — and the failure modes are not random. They are predictable, patterned, and preventable if you design for them from the start.
This research stream grew out of our work building FabricxAI (22 coordinating agents) and NEXUS ARIA (12 agents). We hit every failure mode you can imagine: agents that lost context mid-task, tool calls that failed silently, orchestrators that deadlocked, memory that corrupted across sessions. We documented every incident and derived patterns from the wreckage.
The core questions we investigate: Which coordination topologies work at scale and which collapse? Where must human oversight sit to be effective without being a bottleneck? How do you build memory architecture that keeps agents coherent across hour-long tasks? And how do you design failure recovery that degrades gracefully instead of catastrophically?
Everything we learn goes into sociofi-agent-kit and review-gate — the open-source libraries that came out of this stream — and into the architecture of every agent system we build in the Products and Agents divisions.
After testing peer-to-peer, hierarchical, and market-based coordination topologies across 12 agent systems of varying sizes, orchestrator-worker proved the only topology that maintained reliability past 5 agents. Peer-to-peer coordination produces exponentially growing communication overhead; market-based systems introduce auction latency that compounds under load.
We ported the same task (a 22-step quality control workflow) to each topology and ran it 500 times each, measuring completion rate, latency, and error propagation. Results were consistent across three separate system deployments.
Agents using only conversational context for memory began exhibiting coherence failures around the 30-minute mark. They would contradict earlier decisions, repeat completed steps, or lose track of constraints established at the start of the task. Agents with explicit memory architecture (structured state objects persisted outside the context window) maintained coherence for multi-hour tasks without degradation.
We ran 80 agent tasks of varying duration with and without explicit memory architecture, rating coherence at 10-minute intervals using a rubric scored by two independent reviewers. Coherence failures were defined as contradictions of earlier decisions or repetition of completed steps.
Adding oversight gates without careful placement created bottlenecks that nullified the efficiency gains of automation. Gates positioned after destructive or irreversible actions (database writes, API calls that trigger external effects) with a 15-minute async approval window maintained safety with minimal throughput impact. Gates at every step reduced throughput by 340% with no measurable safety improvement over selective gating.
We tested four gating strategies on the same FabricxAI deployment over 90-day periods each, measuring throughput, critical failure rate, and human reviewer approval-to-rejection ratio.
Agent systems built without explicit failure handling degraded catastrophically when any component failed — not gracefully. Error propagation was non-linear: one agent failure in a 10-agent chain would typically cause 3-4 additional failures within 2 minutes. Systems with explicitly designed fallback chains isolated failures at the point of origin in 89% of cases.
We introduced controlled single-agent failures into production systems with and without explicit failure architecture, measuring how many downstream agents were affected and how long before human intervention was required.
We run adversarial tests against real production agent systems, not synthetic benchmarks. Every experiment uses actual workloads from Studio projects or our own internal infrastructure.
We measure task completion rate, error propagation rate, mean-time-to-recovery, and human intervention frequency. All metrics are compared against a baseline run without the experimental change.
Every failed experiment is documented at the same level of detail as successes. We include what we expected, what we observed, and what we changed because of it. Failures often contain more signal than successes.
Hypothesis: Routing complex reasoning tasks to larger models and simple tasks to smaller, cheaper models should reduce cost without measurable quality loss.
40% cost reduction with no measurable quality loss across 500 task runs.
Complexity routing is now a first-class architectural concern. Routing logic based on task type (reasoning vs. retrieval vs. generation) outperformed routing based on input length alone.
Hypothesis: An AI agent with access to linting, static analysis, and semantic review tools can replace human code review for production code.
Agent caught 60% of issues but missed every critical security vulnerability in the test set.
The 40% it missed included all 12 security vulnerabilities seeded into the test corpus. AI code review is a powerful first-pass tool but cannot replace human security review. We now position it as a complement, not a replacement.
Hypothesis: Five coordinating agents can reliably execute a 22-step quality control workflow with less than 2% error rate.
Achieved 97.3% task completion rate across 300 production runs in the first month.
Multi-agent coordination works in production at this scale. The orchestrator-worker pattern proved reliable. The system was later expanded to 22 agents following the same architectural template.