Sequential A B C Fan-out root w1 w2 w3 Supervisor orchestr. exec-1 exec-2 exec-3 State Machine IDLE RUN loop

Agent demos are easy to build. You spin up a model with a few tools, give it a task, and it produces something impressive. Then you try to put it in production with real data, real edge cases, and real users depending on it — and you discover that "impressive in demo" and "reliable in production" are separated by a significant amount of engineering work.

The most important part of that engineering work is orchestration: how you structure the coordination between agents, how you handle failures, how you manage shared state, and how you ensure that human review happens at the right points. We have built production multi-agent systems with between 8 and 22 agents. Here is what we have learned about the patterns that work.

Pattern 1: Sequential pipelines

The sequential pipeline is the simplest pattern and the right choice more often than people expect. Agent A completes its work, passes structured output to Agent B, which completes its work, passes structured output to Agent C, and so on. No agent runs until the previous one finishes. The output of each stage is the complete, structured input for the next.

When to use it: whenever the work is genuinely sequential — where each stage depends on the output of the previous one and cannot start before it completes. Document processing, code generation pipelines, and content production workflows are all naturally sequential.

The key discipline in sequential pipelines is output structure. If Agent A produces unstructured text that Agent B has to parse and interpret, you have introduced a reliability failure point. If Agent A produces a structured object that matches exactly what Agent B expects, the pipeline is reliable. Design the interface between agents before you design the agents.

Sequential pipelines are also the easiest to debug. When something goes wrong, you can inspect every stage's input and output and find exactly where the failure occurred. This debuggability is worth prioritising, especially in production systems that need to be maintained over time.

Pattern 2: Parallel fan-out

Fan-out is appropriate when you have a set of independent tasks that can run simultaneously and whose results need to be aggregated. A root agent receives the initial input, decomposes it into parallel work items, dispatches those items to worker agents simultaneously, and then aggregates the results when all workers complete.

When to use it: when the tasks are genuinely independent — where the workers do not need each other's output to complete their work — and when the latency savings from parallelism are worth the coordination overhead.

The failure mode to watch for is false independence. Tasks that look independent often have implicit dependencies that only appear under certain conditions. Two agents writing to the same output structure without coordination will produce conflicts. Two agents querying the same external API simultaneously may hit rate limits. Map the actual dependencies before deciding that fan-out is appropriate.

Aggregation is harder than it looks. Collecting results from five parallel agents and producing a coherent, deduplicated, prioritised output is a non-trivial task. Design the aggregation step explicitly — it is usually more complex than the worker steps.

Pattern 3: Supervisor/worker hierarchy

In the supervisor/worker pattern, a supervisor agent maintains the overall task context and delegates specific sub-tasks to specialist worker agents. The supervisor receives results, evaluates them against the task context, and either accepts them, requests revision, or delegates to a different worker. Workers do not communicate with each other — all coordination goes through the supervisor.

When to use it: when the overall task requires ongoing contextual judgment about which sub-tasks to run, in what order, and whether the results are satisfactory. Complex research tasks, multi-step planning workflows, and adaptive content generation are good candidates.

The failure mode here is supervisor overload. If the supervisor is doing too much — coordinating workers, evaluating results, maintaining context, and making decisions about what comes next — its context window fills with coordination overhead and its judgment quality degrades. Keep the supervisor's role as narrow as possible: coordinate and evaluate, do not implement.

Pattern 4: State machine orchestration

State machine orchestration is the most complex pattern and the most reliable one for production systems where correctness and auditability are required. The workflow is modelled as a set of explicit states with defined transition rules. An agent can only be in one state at a time. Every transition is triggered by a specific condition and is logged. Human approval is modelled as a state — the system waits in WAITING_FOR_APPROVAL until a human acts.

When to use it: whenever you need auditability, rollback capability, and reliable human-in-the-loop checkpoints. Systems where an error has real consequences — financial transactions, customer-facing communications, infrastructure changes — should use state machines.

The engineering overhead is real. Designing a complete state machine requires you to enumerate every possible state, every possible transition, and every failure path explicitly. This is time-consuming up front. In production, it pays back many times over.

The shared memory problem

In any multi-agent system, agents need to share context. The naive approach — passing the entire conversation history or task context to every agent — fails at scale. Context windows fill up. Costs increase with every agent invocation. Agents start hallucinating outputs from earlier stages that they did not actually produce.

The pattern that works: a structured shared memory store with typed fields, where each agent reads only the fields it needs and writes only the fields it produces. Each agent's read/write interface is defined at design time. Agents cannot write to fields they do not own. The store is the contract between agents.

This requires more design work upfront. The payoff is that shared state stays coherent as the system runs, agents work from accurate context rather than accumulated noise, and the system remains debuggable — you can inspect the store at any point and understand exactly what each agent knew when it made its decisions.

Human-in-the-loop: where to put gates

Gates should go where errors are costly, irreversible, or hard to detect automatically. Before any external action that cannot be undone — sending an email, executing a database write, calling an external API with side effects. Before any output that will be seen by a customer or stakeholder. After any stage where the agent is working in territory that is ambiguous or poorly specified.

Gates should not go between every stage. An agent pipeline where every stage requires human approval is not autonomous — it is a complicated form of manual processing. Design gates for the transitions that actually carry risk, not for every transition.

What happens when agents fail is as important as what happens when they succeed. Define failure states explicitly: what the system does when an agent returns an error, when an agent's output fails validation, when a timeout occurs, when a human does not respond to an approval request within a defined window. Systems without explicit failure handling fail unpredictably. Systems with explicit failure handling fail gracefully.

Lessons from production systems

Building a 13-agent GTM system taught us that orchestration complexity scales faster than agent count. Going from five agents to thirteen did not produce two-and-a-half times the coordination complexity — it produced something closer to ten times, because the number of potential interaction paths grows non-linearly. The response was to introduce strict isolation between agent clusters: agents within a cluster share state, agents across clusters communicate only through defined interfaces.

Building a 22-agent manufacturing intelligence platform taught us that real-time data changes everything about agent design. Agents that work from static documents are easy to make reliable. Agents that work from live data streams need explicit freshness validation — they need to know when the data they are working from is stale and what to do about it. This adds complexity that is easy to underestimate in the design phase.

The lesson from both: the architecture you design before you start building is not the architecture you will have when the system is in production. Design for change. Use patterns that are easy to modify. Prioritise debuggability over cleverness.

What we would do differently today

We would invest more time in state machine design upfront, even for systems where it seems like overkill. We would be more aggressive about defining inter-agent interfaces before writing any agent implementation. We would build better tooling for inspecting shared memory at runtime — the absence of good observability tools caused more debugging time than any agent design issue.

Explore our research streamsRead the Labs research index