Five years of building production AI systems taught us what works, what breaks under load, and what looks good in demos but fails at 3am. This documents the decisions we landed on and why.
These are not aspirations. They are constraints we impose on every system we build. Violating them is how you end up with a system that works in staging and breaks in production.
A system that returns the right answer in 400ms is better than one that returns a maybe in 80ms. We optimise for correctness first, then throughput. Latency is a product constraint; wrong answers are a product failure.
Every LLM call is logged with its full input, output, cost, latency, and model version before it reaches production. You cannot debug a system you cannot see. We instrument before we optimise.
Silent degradation is the worst failure mode in AI systems. A hallucinated answer that looks correct is more dangerous than a visible error. Our systems raise explicit exceptions on validation failures — they do not patch over them.
Autonomous agents are powerful in bounded task spaces. At any decision point that crosses a trust boundary — shipping code, modifying production data, sending external communication — a human approves. Always.
Large language models are not good at multi-step reasoning in a single prompt. We break complex tasks into small, testable, composable units — each with a clear input schema, a clear output schema, and a clear success condition.
A conceptual map of the system layers we use in production AI pipelines. Every real system adapts this to its specific requirements, but the layer structure stays consistent.
┌─────────────────────────────────────────────────────────────────────┐
│ EXTERNAL INPUTS │
│ User request │ Webhook │ Scheduled job │ API call │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ INPUT VALIDATION LAYER │
│ Schema validation │ Auth check │ Rate limiting │ Sanitisation │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ TASK ROUTER / PLANNER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Task queue │ │ Task type │ │ Priority / budget │ │
│ │ (BullMQ) │ → │ classifier │ → │ enforcement │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Agent A │ │ Agent B │ │ Agent C │ │ Agent N │
│ (Spec) │ │ (Code) │ │ (Review) │ │ (Deploy) │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
└──────────────┴──────────────┴──────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ TOOL EXECUTION LAYER │
│ │
│ ┌─────────────┐ ┌────────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Code tools │ │ DB tools │ │ File I/O │ │ Web tools │ │
│ │ (run,lint) │ │ (query) │ │ (read, │ │ (search, │ │
│ └─────────────┘ └────────────┘ │ write) │ │ fetch) │ │
│ └──────────┘ └────────────┘ │
│ All tool calls: retried 3× with exponential backoff │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ MEMORY LAYER │
│ │
│ Working memory │ Episodic store │ Semantic index │
│ (in-context) │ (Redis/Postgres) │ (pgvector) │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ OUTPUT VALIDATION LAYER │
│ │
│ Zod schema parse → Pass → next stage │
│ → Fail → retry with error injected (max 3×) │
│ → Fail after 3× → human escalation queue │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌────────────┴────────────┐
▼ ▼
┌──────────────────────┐ ┌───────────────────────────────────────┐
│ HUMAN REVIEW GATE │ │ OBSERVABILITY LAYER │
│ │ │ │
│ Decision boundary: │ │ Every LLM call logged: │
│ • Ship to prod │ │ • Input/output (full) │
│ • Modify prod DB │ │ • Cost + latency │
│ • External comms │ │ • Model + temperature │
│ • Auth changes │ │ • Validation result │
│ │ │ • Agent + task context │
│ Human approves. │ │ │
│ Always. │ │ Queryable. Alertable. Auditable. │
└──────────────────────┘ └───────────────────────────────────────┘
Not recommendations. These are the specific technology choices we made after testing alternatives in production, with the reasoning behind each.
LLMs excel at natural language understanding, code generation, and ambiguous instruction following. They are not calculators — we use them for what they are good at: interpreting intent and generating structured outputs from fuzzy inputs.
LLMs have finite context windows. For systems that need to recall information across long sessions or large knowledge bases, we embed information into vectors and retrieve by semantic similarity. We do not stuff context windows.
Multi-agent pipelines need durable, retryable task distribution. We use message queues rather than direct agent-to-agent calls so that failed tasks can be retried, work can be distributed across workers, and the system degrades gracefully under load.
Every agent output is validated against a typed schema before it enters the next pipeline stage. Invalid outputs trigger automatic retry with the validation error appended to the prompt. This eliminates an entire class of downstream failures.
Debugging AI systems without logs is guessing. We log every request and response — including cost, latency, model version, and validation outcome — in a structured format queryable by agent, task type, and failure mode.
AI-generated code does not ship automatically. It enters a review queue where an engineer reads, tests, and approves it. The pipeline automates the generation and testing; the human approves the merge. Speed without trust is a liability.
Every one of these we either did ourselves or saw break in a system we were brought in to fix. Documenting failures is more useful than documenting successes.
Giving a single agent responsibility for a long, multi-step workflow with external side effects — sending emails, modifying databases, deploying code — without checkpoints. When the agent misunderstands step 2, you find out at step 9 with irreversible consequences.
Multiple agents writing to the same data structure without coordination creates race conditions and conflicting interpretations. We've seen agents overwrite each other's outputs, producing results that reflect neither agent's reasoning correctly.
Skipping logging "to keep things simple" during development means the first production failure is undebuggable. Retrofitting observability into a running AI system is far harder than adding it from the start.
Pipelines that generate code and auto-merge PRs without human review. We ran this experiment. AI-generated code that passes tests is still AI-generated code — it can be subtly wrong in ways tests don't catch.
Telling the LLM to "always return valid JSON" and hoping for the best. LLMs ignore instructions under certain input conditions. Validation must be in code, not in prompts.