Agent Architecture

ACTIVE5 articles published8 experiments logged

How do you build AI agents that are reliable, observable, and composable in production?

Most AI agent demonstrations look compelling and break under real workloads. The gap between a convincing demo and a production-grade agent system is substantial — and the failure modes are not random. They are predictable, patterned, and preventable if you design for them from the start.

This research stream grew out of our work building FabricxAI (22 coordinating agents) and NEXUS ARIA (12 agents). We hit every failure mode you can imagine: agents that lost context mid-task, tool calls that failed silently, orchestrators that deadlocked, memory that corrupted across sessions. We documented every incident and derived patterns from the wreckage.

The core questions we investigate: Which coordination topologies work at scale and which collapse? Where must human oversight sit to be effective without being a bottleneck? How do you build memory architecture that keeps agents coherent across hour-long tasks? And how do you design failure recovery that degrades gracefully instead of catastrophically?

Everything we learn goes into sociofi-agent-kit and review-gate — the open-source libraries that came out of this stream — and into the architecture of every agent system we build in the Products and Agents divisions.

Published findings

Research articles

Lessons from 45 Production Agents: What We Got Right and What We'd Change

After building 45 production AI agents across FabricxAI, NEXUS ARIA, and our internal pipeline, here is an honest account of what works, what fails, and what we would do differently.

agentsarchitectureproductionMarch 202612 min read

The Human Gate Pattern: Inserting Human Oversight Without Killing Agent Autonomy

A design pattern for inserting mandatory human approval into automated agent workflows without creating bottlenecks that defeat the purpose of automation.

oversightpatternsautonomyFebruary 20269 min read

When Agents Fail: Designing Graceful Degradation in Multi-Agent Architectures

A systematic approach to designing agent systems that fail gracefully. What happens when an agent makes a mistake or encounters an edge case it was not designed for.

failurereliabilitygraceful-degradationJanuary 202614 min read

Memory Architecture for Long-Running Agent Systems

How to design memory and state management for AI agents that run for hours or days. Short-term context, long-term memory, and the failure modes of getting it wrong.

memorystatearchitectureDecember 202513 min read

What we know so far

Key findings

Evidence

After testing peer-to-peer, hierarchical, and market-based coordination topologies across 12 agent systems of varying sizes, orchestrator-worker proved the only topology that maintained reliability past 5 agents. Peer-to-peer coordination produces exponentially growing communication overhead; market-based systems introduce auction latency that compounds under load.

Methodology

We ported the same task (a 22-step quality control workflow) to each topology and ran it 500 times each, measuring completion rate, latency, and error propagation. Results were consistent across three separate system deployments.

Evidence

Agents using only conversational context for memory began exhibiting coherence failures around the 30-minute mark. They would contradict earlier decisions, repeat completed steps, or lose track of constraints established at the start of the task. Agents with explicit memory architecture (structured state objects persisted outside the context window) maintained coherence for multi-hour tasks without degradation.

Methodology

We ran 80 agent tasks of varying duration with and without explicit memory architecture, rating coherence at 10-minute intervals using a rubric scored by two independent reviewers. Coherence failures were defined as contradictions of earlier decisions or repetition of completed steps.

Evidence

Adding oversight gates without careful placement created bottlenecks that nullified the efficiency gains of automation. Gates positioned after destructive or irreversible actions (database writes, API calls that trigger external effects) with a 15-minute async approval window maintained safety with minimal throughput impact. Gates at every step reduced throughput by 340% with no measurable safety improvement over selective gating.

Methodology

We tested four gating strategies on the same FabricxAI deployment over 90-day periods each, measuring throughput, critical failure rate, and human reviewer approval-to-rejection ratio.

Evidence

Agent systems built without explicit failure handling degraded catastrophically when any component failed — not gracefully. Error propagation was non-linear: one agent failure in a 10-agent chain would typically cause 3-4 additional failures within 2 minutes. Systems with explicitly designed fallback chains isolated failures at the point of origin in 89% of cases.

Methodology

We introduced controlled single-agent failures into production systems with and without explicit failure architecture, measuring how many downstream agents were affected and how long before human intervention was required.

How we work

Methodology

We run adversarial tests against real production agent systems, not synthetic benchmarks. Every experiment uses actual workloads from Studio projects or our own internal infrastructure.

Experiment types

Load testing multi-agent coordination topologies under sustained high-volume task queues
Fault injection: killing individual agents mid-task and measuring system recovery
Memory corruption scenarios: deliberately exhausting context windows and measuring coherence
Tool failure cascades: simulating downstream API failures and testing fallback chains
Human gate placement A/B tests: measuring critical failure rates at different oversight positions

How we measure

We measure task completion rate, error propagation rate, mean-time-to-recovery, and human intervention frequency. All metrics are compared against a baseline run without the experimental change.

Transparency commitment

Every failed experiment is documented at the same level of detail as successes. We include what we expected, what we observed, and what we changed because of it. Failures often contain more signal than successes.

Radical transparency

Experiment log — including the failures

Multi-model orchestration: routing by task complexity

Confirmed

Hypothesis: Routing complex reasoning tasks to larger models and simple tasks to smaller, cheaper models should reduce cost without measurable quality loss.

Result

40% cost reduction with no measurable quality loss across 500 task runs.

What we learned

Complexity routing is now a first-class architectural concern. Routing logic based on task type (reasoning vs. retrieval vs. generation) outperformed routing based on input length alone.

February 2026

Autonomous code review: zero human reviewer

Disproven

Hypothesis: An AI agent with access to linting, static analysis, and semantic review tools can replace human code review for production code.

Result

Agent caught 60% of issues but missed every critical security vulnerability in the test set.

What we learned

The 40% it missed included all 12 security vulnerabilities seeded into the test corpus. AI code review is a powerful first-pass tool but cannot replace human security review. We now position it as a complement, not a replacement.

March 2026

FabricxAI initial deployment: 5-agent coordination

Confirmed

Hypothesis: Five coordinating agents can reliably execute a 22-step quality control workflow with less than 2% error rate.

Result

Achieved 97.3% task completion rate across 300 production runs in the first month.

What we learned

Multi-agent coordination works in production at this scale. The orchestrator-worker pattern proved reliable. The system was later expanded to 22 agents following the same architectural template.

September 2025

Active investigations

What we are working on next

Hierarchical orchestration for 20+ agent systems

We are investigating whether adding a coordination layer between the orchestrator and worker agents improves scalability past 20 agents, or whether it simply shifts the bottleneck.

Cross-session memory persistence with semantic compression

Exploring whether summarising prior session context semantically (rather than truncating) maintains coherence while keeping memory footprints manageable for week-long agent tasks.

Agent-to-agent authentication in untrusted environments

As agent systems become more distributed, verifying that a message claiming to be from an orchestrator is actually from that orchestrator becomes non-trivial. We are testing JWT-based authentication between agents in multi-process deployments.

Browse all experiments

Research streams

Agent Architecture

Research articles

Key findings

Methodology

Experiment log — including the failures

What we are working on next

Related streams