Most agent orchestration systems break under edge cases. Not because the agents themselves fail — because the coordination layer between them has no explicit model of what the system is doing or what should happen when something goes wrong. The system enters a state it was not designed for, and there is no defined path back to safety.
State machine orchestration solves this by making the coordination model explicit. Every state the system can be in is defined. Every transition between states is triggered by a specific condition. Every failure path is a known state with a defined recovery procedure. Human approval is not an interruption to the workflow — it is a state, with defined entry conditions, timeout behaviour, and exit transitions.
What a state machine is
A state machine is a model that describes a system as a set of states, a set of transitions between those states, and the conditions that trigger each transition. At any point in time, the system is in exactly one state. It transitions to another state when a defined condition is met.
The practical value in agent systems: a state machine forces you to enumerate the states your system can be in and the transitions between them. This enumeration surfaces assumptions you did not know you were making — cases you had not considered, failure paths you had not defined. The work of building the state machine is the work of making your system's behaviour explicit and complete.
The five key states
IDLE. The system is not executing. Waiting for a trigger — a scheduled time, an external event, a human request. The system can receive new task input in this state.
RUNNING. The system is executing its pipeline. Agents are working. State is being updated. The system is progressing toward a terminal state or a gate.
WAITING_FOR_APPROVAL. The system has reached a human review gate. Execution is paused. A human needs to act. The system holds its current state — no agents run, no state mutates — until a human approves or rejects.
FAILED. The system has encountered an error it cannot resolve autonomously. Execution has stopped. The failure is logged with the state at the time of failure, the agent that failed, and the input that caused the failure. Recovery requires human action.
COMPLETE. The system has successfully finished its task and produced its outputs. The final state is recorded with a timestamp. The system returns to IDLE after completing any post-task housekeeping.
Transition rules
Every transition has exactly one trigger condition. IDLE to RUNNING: task input received and validated. RUNNING to WAITING_FOR_APPROVAL: a gate condition is met — a specific agent output requires human review, or the pipeline has reached a defined checkpoint. WAITING_FOR_APPROVAL to RUNNING: a human approves, and execution resumes from the point where it paused. WAITING_FOR_APPROVAL to FAILED: a human rejects, or the approval timeout window expires without action. RUNNING to FAILED: an agent returns an error, output validation fails, or a defined exception condition occurs. FAILED to RUNNING: human investigation has resolved the failure and approved resumption. RUNNING to COMPLETE: all pipeline stages finish successfully and output validation passes.
The key discipline: every possible transition must be defined. A system where RUNNING can only transition to WAITING_FOR_APPROVAL, FAILED, or COMPLETE has no undefined states. A system where RUNNING can transition to anything depending on context-specific logic has unmodelled states — and it will eventually reach one of them.
Why WAITING_FOR_APPROVAL is the most important state
Most agent orchestration frameworks treat human approval as an interruption — the system pauses somehow, a human does something, the system resumes. The mechanism for pausing, the mechanism for resuming, what happens if the human never acts, and what state the system is in during all of this are usually implementation details rather than first-class design decisions.
Making WAITING_FOR_APPROVAL a named state with defined behaviour changes this. The state has an entry condition: a specific gate criterion in the pipeline. It has defined behaviour while active: no agents run, state is locked, an approval request is dispatched to the designated reviewer. It has a timeout: if no action is taken within a defined window, the system transitions to FAILED with a TIMEOUT_APPROVAL reason code. It has two exit paths: approval (back to RUNNING) and rejection (to FAILED). Every one of these is explicit, tested, and logged.
This matters in production for a non-obvious reason: human reviewers are not always available immediately. Systems that pause indefinitely while waiting for human action create invisible bottlenecks. Systems with explicit timeouts and defined timeout behaviour fail visibly, which is far better than failing invisibly.
The audit log
Every state transition is logged: the state transitioned from, the state transitioned to, the trigger condition, the timestamp, and — for human-triggered transitions — the identity of the human who acted and their decision record.
This audit log is the most valuable artefact of state machine orchestration. When something goes wrong in production, the audit log tells you exactly what the system was doing, what state it was in, what triggered the transition to the failure state, and what the human decisions were that led up to it. You are never investigating a black box — you are reading a timestamped record of every decision the system made.
The audit log is also your compliance record. For regulated industries — finance, healthcare, manufacturing — demonstrating that human review occurred at specific checkpoints, with a specific person, at a specific time, is not just operationally valuable. It is required.
Rollback: how state machines make recovery predictable
When an agent system fails without explicit state management, recovery is archaeology: you try to reconstruct what the system was doing from logs and code, figure out what changed, and decide what to revert. This is slow and error-prone, especially under pressure.
With state machine orchestration, rollback is a defined procedure. The FAILED state captures a snapshot of the system state at the point of failure. Recovery procedures are defined per failure reason code. A VALIDATION_FAILURE has a different recovery procedure than a TIMEOUT_APPROVAL or an EXTERNAL_API_ERROR. Engineers do not improvise recovery — they execute the defined procedure for the specific failure type.
The system can be rolled back to any previous recorded state, because every state transition was recorded. You can replay from a specific checkpoint, re-run from a specific stage, or resume from the exact point where a failure occurred — once the failure condition has been resolved.
Implementation sketch: a discovery-to-specification pipeline
A pipeline that takes a client project brief and produces a technical specification can be modelled as a 21-state machine: seven pipeline stages, each with STAGE_RUNNING, STAGE_COMPLETE, and STAGE_FAILED sub-states, plus two WAITING_FOR_APPROVAL states at the requirement confirmation and architecture approval gates.
states:
IDLE
SCOUT_RUNNING → SCOUT_COMPLETE | SCOUT_FAILED
WAITING_REQUIREMENT_APPROVAL
HUNTER_RUNNING → HUNTER_COMPLETE | HUNTER_FAILED
ATLAS_RUNNING → ATLAS_COMPLETE | ATLAS_FAILED
WAITING_ARCHITECTURE_APPROVAL
FORGE_RUNNING → FORGE_COMPLETE | FORGE_FAILED
SENTINEL_RUNNING → SENTINEL_COMPLETE | SENTINEL_FAILED
COMPLETE
FAILED
transitions:
IDLE → SCOUT_RUNNING: task_received
SCOUT_COMPLETE → WAITING_REQUIREMENT_APPROVAL: always
WAITING_REQUIREMENT_APPROVAL → HUNTER_RUNNING: human_approved
WAITING_REQUIREMENT_APPROVAL → FAILED: human_rejected | timeout_72h
ATLAS_COMPLETE → WAITING_ARCHITECTURE_APPROVAL: always
WAITING_ARCHITECTURE_APPROVAL → FORGE_RUNNING: human_approved
WAITING_ARCHITECTURE_APPROVAL → ATLAS_RUNNING: human_requested_revision
SENTINEL_COMPLETE → COMPLETE: all_checks_passed
SENTINEL_COMPLETE → WAITING_ARCHITECTURE_APPROVAL: critical_findings
any_FAILED → FAILED: propagate
This model makes explicit every state the system can be in, every condition that drives transitions, and every recovery path. Building it surfaces questions that would otherwise surface as production incidents.
What you lose by using event-driven patterns instead
Event-driven orchestration — where agents emit events and other agents react to them — is flexible and composable. It is also significantly harder to reason about, audit, and recover from when things go wrong. The system's state at any point in time is implicit: distributed across event queues, agent internals, and shared stores in ways that are difficult to inspect holistically.
For agent systems where correctness, auditability, and controlled human oversight are requirements, event-driven patterns pay a reliability tax. The flexibility they offer is real, but in most production agent workflows it is flexibility you do not need — and the auditability you lose is something you always need.
Explore our agent research — Read the Labs research index