Build Log: Architecting a 22-Agent Manufacturing Intelligence Platform — SocioFi Labs

The manufacturing intelligence platform we built over seven months is the most complex system we have shipped. Twenty-two specialist agents across four operational clusters, processing live factory floor data, coordinating with supply chain systems, and surfacing quality and production insights to operations teams who previously got this information in weekly spreadsheet reports — if they got it at all.

This is the technical story of building it. The architecture decisions, the things that broke, and the parts we would change if we started today.

The problem: garment manufacturing runs on spreadsheets and WhatsApp

The client operates multiple garment manufacturing facilities. Their operational intelligence — production throughput, quality rejection rates, machine downtime, supplier lead times, inventory levels — lived in a combination of Excel files maintained by floor supervisors, WhatsApp groups where managers shared updates, and a legacy ERP system that was accurate for financial records but weeks behind for operational data.

The gap between what was happening on the floor and what management could see was measured in days. By the time a quality issue was visible in a report, it had already affected hundreds of units. By the time a production bottleneck showed up in the data, the orders affected by it were already late.

The goal was not to replace the human judgment of operations managers. It was to give them accurate, timely information and flag the situations that required their attention — so they could spend their time making decisions rather than compiling data.

Why a single-agent approach would not work

We considered a single orchestrator model that would pull all operational data and answer queries. The problem was context volume. A manufacturing operation with five facilities, hundreds of product lines, and thousands of daily production events produces more data than fits in any model's context window. A single agent asked to reason across all of it would either truncate critical data or produce unreliable answers from an overloaded context.

The second problem was latency. Some queries need to be answered in seconds — a quality alert on a production line cannot wait for an agent to process the full operational context. Other queries can tolerate minutes — a weekly trend analysis does not need real-time response. A single agent architecture cannot optimise for both simultaneously.

The cluster architecture solved both problems: agents within a cluster work with bounded, domain-specific context, and each cluster can be optimised for its latency and throughput requirements independently.

The 4-cluster architecture

Quality Control cluster (6 agents). INSPECTOR monitors incoming material quality data from supplier shipments and flags deviations from specification. AUDITOR processes end-of-line quality checks and identifies rejection patterns. TRACER performs root cause analysis on quality failures, linking rejection events to upstream causes. CALIBRATOR monitors machine calibration data and flags drift before it causes rejection events. REPORTER aggregates quality data into shift and daily summaries. ESCALATOR identifies quality events that require supervisor intervention and routes alerts.

Production Tracking cluster (6 agents). THROUGHPUT monitors actual production rates against targets in real time. BOTTLENECK identifies which stations are constraining total throughput. SCHEDULER monitors order progress against committed delivery dates and flags at-risk orders. DOWNTIME processes machine downtime events and categorises causes. EFFICIENCY calculates output per hour across facilities and production lines. FORECAST projects end-of-shift and end-of-day production totals based on current throughput.

Supply Chain cluster (5 agents). INVENTORY monitors raw material stock levels and generates reorder signals. SUPPLIER tracks inbound shipment status and flags delays. LEAD_TIME calculates and maintains running lead time estimates for each supplier and material category. CONSUMPTION projects material consumption based on production schedules and triggers procurement actions. VARIANCE identifies deviations between planned and actual material usage.

Analytics cluster (5 agents). TREND analyses production and quality data over rolling time windows to identify patterns not visible in real-time monitoring. BENCHMARK compares performance across facilities and shifts. ANOMALY identifies statistical outliers in any operational metric. INSIGHT synthesises data across clusters to generate management-level summaries. REPORT assembles scheduled reports from cluster outputs.

How agents within a cluster share state

Each cluster has a shared state store: a structured object with typed fields, a version number, and a timestamp. Agents within the cluster read from and write to specific fields they own. An agent can read any field but can only write to fields assigned to it at design time.

The cluster state store is updated on a defined tick — every 60 seconds for real-time clusters, every 10 minutes for analytics clusters. Agents that run on the same tick operate on a consistent snapshot of the previous tick's state. This prevents the race conditions that occur when agents run asynchronously against a shared mutable store.

Cross-cluster communication goes through defined interfaces: structured messages that one cluster publishes and another subscribes to. The Quality Control cluster publishes quality event messages that the Analytics cluster subscribes to. The Production Tracking cluster publishes throughput data that the Supply Chain cluster uses for consumption projections. Clusters do not share state directly — they communicate through typed message channels.

The real-time data challenge

Agents working from static documents have a simple relationship with data: the document is what it is, and the agent works from it. Agents working from live factory floor data have a more complex problem: the data changes constantly, and the agent needs to know whether the data it is working from is current.

We introduced a freshness validation step at the start of every agent execution. Each agent checks the timestamp of the data fields it will use and compares them against configured freshness thresholds. An agent that needs data from the last 60 seconds will abort with a STALE_DATA error rather than produce analysis based on 10-minute-old readings. The system falls back to cached last-known-good values for certain fields while flagging the staleness to the monitoring layer.

This validation caught a class of errors we had not anticipated: sensor dropouts that produced a stream of identical readings rather than nulls. An agent without freshness validation would process these as valid data. With timestamp validation, the repeated values were correctly identified as stale.

Approval gates: where human review is mandatory

We defined four categories of action that require human approval regardless of agent confidence: procurement orders above a defined threshold, quality hold decisions that stop production lines, supplier communications, and any alert that goes to the client's executive team. Everything else — internal alerts, data aggregation, status summaries — runs without a gate.

The gates are implemented as state transitions in the cluster state machine. An agent that generates a gate-required action transitions to WAITING_FOR_APPROVAL and logs the action, its confidence score, and the supporting data that led to it. The human reviewer sees this information in the approval interface and can approve, reject, or modify. The decision is logged with a timestamp and the reviewer's identity.

What broke first

The first production failure was the BOTTLENECK agent's dependency on THROUGHPUT data that was more stale than its freshness threshold allowed. THROUGHPUT had a sensor connectivity issue that triggered the STALE_DATA path. BOTTLENECK, unable to get fresh throughput data, correctly aborted — but the failure state it published to the cluster store caused SCHEDULER to treat all orders as at-risk, generating a cascade of unnecessary alerts to the operations team.

The fix was to distinguish between agent failures caused by upstream data issues versus agent failures caused by the agent itself, and to suppress downstream dependent agents when the failure is upstream data — letting them run but with a degraded-mode flag rather than treating all their outputs as failures.

The second failure was the INSIGHT agent's context filling up during peak production hours when all clusters were publishing simultaneously. We had not anticipated that the volume of cross-cluster messages could saturate the INSIGHT agent's context window. The fix was to implement a summarisation layer that compresses cluster updates before they reach INSIGHT, preserving the signal while discarding the volume.

Performance outcomes

Quality rejection rates visible to operations managers went from daily to real-time. The average time from a quality event to operations team awareness dropped from 4–6 hours to under 8 minutes. Production forecasting accuracy at end-of-shift improved significantly compared to the baseline manual estimates. Material waste from over-ordering against manual estimates decreased.

The metric we are still measuring is the accuracy of the BOTTLENECK agent's constraint identification. The agent is correct in identifying the primary constraint in the majority of cases, but the cases where it misidentifies the constraint are systematically correlated with a specific category of multi-cause bottleneck that the skill document does not yet model well. We are in the process of updating the skill document based on the production data.

See how Studio builds systems like this — Explore Studio services