Research Streams
Four questions.
Hundreds of experiments.
We publish everything we learn — methods, results, and failures. Four active streams, all open.
01
Active
Agent Architecture
How do you build AI agents that are reliable, observable, and composable in production?
5Articles
8Experiments
Key findings so far
- Orchestrator-worker is the only pattern that scales past 5 agents — every other coordination model degrades under load
- Agents lose coherence after ~30 minutes without explicit memory architecture — short-term context alone is insufficient
- Human oversight gates reduced critical failures by 94% in our FabricxAI deployment — position matters as much as presence
- Graceful degradation requires as much design effort as the happy path; agent systems without explicit failure modes degrade catastrophically
Multi-agent coordinationTool useMemory systemsFailure recovery
Read the research02
Active
Applied AI
What does AI actually do well — and where does it fail — in real production software?
4Articles
7Experiments
Key findings so far
- RAG retrieval quality, not model capability, is the dominant factor in answer accuracy for domain-specific questions
- Structured output extraction fails silently — models return plausible-looking JSON that violates the schema without signalling the error
- Prompt regression is real: a prompt that works well today degrades measurably after a model update, with no notification
- Hallucination mitigation via self-consistency checking adds 40–60% cost but reduces confident wrong answers by over 80%
LLM benchmarkingPrompt engineeringRAG pipelinesHallucination mitigation
Read the research03
Active
Developer Tooling
How do we make AI-assisted development workflows faster, safer, and more auditable?
4Articles
6Experiments
Key findings so far
- AI code review tools are abandoned within 2 weeks unless they surface issues engineers cannot easily catch manually — redundancy kills adoption
- Spec-to-code pipelines fail at the architecture layer: AI generates syntactically correct code built on structurally wrong assumptions
- Test generation from existing code achieves 82% average coverage (up from 45%) but consistently misses security and edge case scenarios
- AI deployment gates that block on uncertainty — rather than just flag — reduce production incidents by 67% with acceptable false-positive rates
AI code reviewTesting automationSpec-to-code pipelinesCI/CD with AI gates
Read the research04
Active
Industry Automation
Which vertical-specific business processes are genuinely automatable today, and which are not?
3Articles
5Experiments
Key findings so far
- Document extraction accuracy degrades sharply on non-standard layouts — 94% on clean forms, 61% on free-form documents with the same model
- Customer communication automation works for high-volume, low-stakes interactions but requires mandatory human escalation paths; without them, the 15% error rate causes outsized damage
- Compliance automation in regulated industries requires explainability by default — "the AI decided" is not an acceptable audit trail in fintech or legal contexts
- Workflow orchestration bottlenecks are almost never technical — they are data quality and change management problems that AI amplifies rather than solves
Document processingCustomer communicationData extractionWorkflow orchestration
Read the researchLatest results
Recent experiment results
Multi-model agent orchestration (routing by task complexity)
40% cost reduction with no measurable quality loss. Now standard in all deployments.
Confirmed
Autonomous code review with no human reviewer
Caught 60% of issues; missed every security vulnerability. Human gates are non-negotiable.
Disproven
RAG pipeline with semantic query routing pre-retrieval
Answer relevance improved 38%. Query routing is now a first-class architectural concern.
Confirmed
All experiments, including the failures.
The experiments log documents every hypothesis we tested — what we expected, what happened, and what we changed because of it.
Browse all experiments