Research Streams

Four questions.
Hundreds of experiments.

We publish everything we learn — methods, results, and failures. Four active streams, all open.

Active

Agent Architecture

How do you build AI agents that are reliable, observable, and composable in production?

5Articles

8Experiments

Key findings so far

Orchestrator-worker is the only pattern that scales past 5 agents — every other coordination model degrades under load
Agents lose coherence after ~30 minutes without explicit memory architecture — short-term context alone is insufficient
Human oversight gates reduced critical failures by 94% in our FabricxAI deployment — position matters as much as presence
Graceful degradation requires as much design effort as the happy path; agent systems without explicit failure modes degrade catastrophically

Multi-agent coordinationTool useMemory systemsFailure recovery

Read the research

Active

Applied AI

What does AI actually do well — and where does it fail — in real production software?

4Articles

7Experiments

Key findings so far

RAG retrieval quality, not model capability, is the dominant factor in answer accuracy for domain-specific questions
Structured output extraction fails silently — models return plausible-looking JSON that violates the schema without signalling the error
Prompt regression is real: a prompt that works well today degrades measurably after a model update, with no notification
Hallucination mitigation via self-consistency checking adds 40–60% cost but reduces confident wrong answers by over 80%

LLM benchmarkingPrompt engineeringRAG pipelinesHallucination mitigation

Read the research

Active

Developer Tooling

How do we make AI-assisted development workflows faster, safer, and more auditable?

4Articles

6Experiments

Key findings so far

AI code review tools are abandoned within 2 weeks unless they surface issues engineers cannot easily catch manually — redundancy kills adoption
Spec-to-code pipelines fail at the architecture layer: AI generates syntactically correct code built on structurally wrong assumptions
Test generation from existing code achieves 82% average coverage (up from 45%) but consistently misses security and edge case scenarios
AI deployment gates that block on uncertainty — rather than just flag — reduce production incidents by 67% with acceptable false-positive rates

AI code reviewTesting automationSpec-to-code pipelinesCI/CD with AI gates

Read the research

Active

Industry Automation

Which vertical-specific business processes are genuinely automatable today, and which are not?

3Articles

5Experiments

Key findings so far

Document extraction accuracy degrades sharply on non-standard layouts — 94% on clean forms, 61% on free-form documents with the same model
Customer communication automation works for high-volume, low-stakes interactions but requires mandatory human escalation paths; without them, the 15% error rate causes outsized damage
Compliance automation in regulated industries requires explainability by default — "the AI decided" is not an acceptable audit trail in fintech or legal contexts
Workflow orchestration bottlenecks are almost never technical — they are data quality and change management problems that AI amplifies rather than solves

Document processingCustomer communicationData extractionWorkflow orchestration

Read the research

Recent experiment results

Multi-model agent orchestration (routing by task complexity)

40% cost reduction with no measurable quality loss. Now standard in all deployments.

Agent ArchitectureFebruary 2026

Confirmed

Autonomous code review with no human reviewer

Caught 60% of issues; missed every security vulnerability. Human gates are non-negotiable.

Developer ToolingMarch 2026

Disproven

RAG pipeline with semantic query routing pre-retrieval

Answer relevance improved 38%. Query routing is now a first-class architectural concern.

Applied AIJanuary 2026

Confirmed

All experiments, including the failures.

The experiments log documents every hypothesis we tested — what we expected, what happened, and what we changed because of it.

Browse all experiments

Four questions.Hundreds of experiments.

Recent experiment results

All experiments, including the failures.

Four questions.
Hundreds of experiments.