Multi-Agent Handoff Protocol v1
Hypothesis
A shared message queue reduces inter-agent latency by >30% compared to direct point-to-point agent communication.
Result
Reduced by 41% in 3-agent chains. Performance degraded significantly in 7+ agent chains — queue saturation at high concurrency introduced new bottlenecks we had not anticipated.
Key Learnings
- Message queue architecture scales well up to 5-6 agents but requires a different topology beyond that — a hub-and-spoke model outperforms linear queues at scale.
- Backpressure handling is the critical engineering problem in multi-agent systems; most early approaches ignore it until they hit production.
- The 41% improvement confirms the hypothesis for the tested range, but the failure at 7+ agents is the more valuable finding — it defines the boundary conditions.
RAG Chunking Strategy Comparison
Hypothesis
Semantic chunking outperforms fixed-size chunking for technical documentation retrieval accuracy.
Result
Semantic chunking was 23% more accurate on technical Q&A benchmarks versus fixed-size chunking at 512 tokens. It indexed 8% slower. The accuracy gain justifies the indexing overhead for documentation use cases.
Key Learnings
- Fixed-size chunking breaks mid-sentence and mid-concept at exactly the points where retrieval needs coherence — the failure mode is systematic, not random.
- Semantic chunking gains are largest for long-form technical documentation (API references, architecture docs) and smallest for short-form structured content (FAQs, changelogs).
- The 8% indexing overhead is a one-time cost that matters less than retrieval quality in most production scenarios; the tradeoff analysis depends heavily on update frequency.
Spec-to-Code Accuracy Measurement
Hypothesis
Structured specification documents reduce ambiguity and improve first-pass code accuracy compared to free-form brief inputs.
Result
34% reduction in revision cycles when using structured specs versus free-form briefs. First-pass acceptance rate improved from 51% to 79% across a sample of 40 feature implementations.
Key Learnings
- The format of the spec matters as much as its completeness — specs with explicit acceptance criteria drove the largest improvements, not just specs with more detail.
- Agent-generated specs (where the AI asks clarifying questions to build the spec) performed nearly as well as human-written specs, suggesting the spec-generation step itself can be automated.
- The 34% reduction likely understates the full efficiency gain; revision cycles have non-linear cost — later revisions are exponentially more expensive than earlier ones.
Industry Document Classification Pipeline
Hypothesis
LLMs can replace rule-based classifiers for business document routing with higher accuracy and lower maintenance overhead.
Result
95.3% accuracy on invoice/purchase order/receipt classification, outperforming our rule-based system at 89.1%. The LLM approach required zero maintenance when document formats changed — rule-based systems required manual rule updates for every format variation.
Key Learnings
- The accuracy advantage is real but not the most important finding — the maintenance advantage is. Rule-based classifiers require constant upkeep as document formats drift; the LLM approach adapts without intervention.
- The 4.7% error rate is concentrated in heavily damaged or non-standard documents that humans also struggle with — the system fails in the same places human judgment fails, which is the right failure profile.
- Confidence scoring matters: routing high-confidence classifications automatically and flagging low-confidence items for human review brings effective accuracy above 99% in practice.
Long-Context Agent Memory Architecture
Hypothesis
Hierarchical memory (working memory / episodic memory / semantic memory) enables coherent reasoning across 100+ step tasks where flat context windows fail.
Current Status
In progress. Early results suggest working memory compression at step transitions is the key engineering challenge. Preliminary data shows 60% fewer context loss errors versus flat context at 50+ steps.
Key Learnings
- Episodic memory (storing summarized task history) is showing more value than expected — agents can reference earlier decisions without re-reading full context.
- Memory retrieval timing is critical — retrieving too early or too late in the reasoning cycle introduces noise. Still characterizing the optimal retrieval window.
Cross-Model Prompt Portability
Hypothesis
Prompts optimized for one model lose less than 15% effectiveness when ported to another model if they follow a structured format with explicit role, context, constraints, and output format sections.
Current Status
In progress. Testing across 4 model families on 12 task categories. Initial data shows structured prompts retain 83-91% effectiveness on transfer, versus 64-72% for unstructured prompts.
Key Learnings
- Early finding: instruction-following and output-format prompts transfer well; reasoning-intensive prompts show the most degradation across models.
- Model-specific idioms ("think step by step", chain-of-thought triggers) reduce portability significantly — need a model-agnostic instruction vocabulary.
Agent Self-Correction Loop
Hypothesis
An agent reviewing its own output before submission would reduce output errors by 50% versus no self-review.
Result
No statistically significant improvement. Agent self-reviews showed the same blind spots as the original outputs. Abandoned after 3 weeks of structured testing across 200 generation tasks.
Why it failed
The self-review mechanism was flawed by design: the same model that made an error in generation will make the same error in evaluation. Self-review is not an independent check — it is the same process run twice.
Key Learnings
- LLMs exhibit systematic self-evaluation blind spots — errors in reasoning are not detectable by the same reasoning process that produced them. This is a fundamental constraint, not an implementation problem.
- The failure clarifies the role of human review and multi-model validation: the only reliable error-catching mechanism is an independent evaluator with different priors.
- The experiment ruled out an entire category of optimization approaches. That is genuinely useful: we can stop testing self-review variants and redirect research toward cross-model or human-in-the-loop verification.
Zero-Shot Industry Automation
Hypothesis
Zero-shot LLM prompting can replace fine-tuned models for structured data extraction from industry-specific documents, reducing the need for labeled training data.
Result
Accuracy dropped from 95% (fine-tuned model) to 71% (zero-shot) on edge cases. Not production-viable for high-stakes document processing where errors have financial or compliance consequences.
Why it failed
Edge cases in real-world documents require domain-specific pattern recognition that generalizes poorly from zero-shot prompting. The 24% accuracy gap concentrates on exactly the document types where errors matter most.
Key Learnings
- Zero-shot works for common document patterns but degrades on domain-specific variants — the long tail of document formats is where fine-tuned models earn their keep.
- The failure identifies a clear threshold: zero-shot is viable when error cost is low and coverage of common patterns is sufficient; fine-tuning is required when edge-case accuracy is critical.
- Few-shot prompting with carefully selected examples recovers approximately half the accuracy gap (from 71% to ~83%) — a middle ground that may be viable for medium-stakes use cases.
Automated Architecture Generation
Hypothesis
LLM can generate production-ready system architecture from high-level requirements, reducing the architect review step to final sign-off rather than active design.
Result
Generated architectures were technically valid but missed domain-specific constraints in 60% of cases. Required full human architect review for every output — no efficiency gain over starting from scratch.
Why it failed
High-level requirements do not encode the organizational, compliance, operational, and political constraints that shape real architecture decisions. The LLM optimized for technical correctness while missing the constraints that actually determine what gets built.
Key Learnings
- System architecture is constrained by factors outside the technical specification — existing infrastructure, team skill sets, regulatory requirements, vendor relationships. These are not capturable in a brief.
- The experiment revealed that LLM architecture generation is most useful as a starting point for discussion, not as an output — it surfaces options and trade-offs that a human architect can then evaluate against real constraints.
- A more promising direction: LLM as architecture reviewer rather than generator — checking proposed architectures against a library of anti-patterns and common failure modes.
Cross-Provider LLM Routing
Hypothesis
Dynamic routing between LLM providers based on task type and complexity reduces total inference cost by 40% with less than 5% accuracy degradation.
Result
The routing decision overhead — classification latency, routing logic, fallback handling — exceeded the cost savings in most real-world usage patterns. In high-volume scenarios the economics improved, but the engineering complexity was not justified.
Why it was abandoned
The simpler solution — selecting the appropriate provider at deployment time based on use-case characteristics — achieves most of the cost benefit without runtime routing complexity. We shipped the simpler approach.
Key Learnings
- Dynamic routing adds latency and operational complexity. When the savings are real but achievable through simpler means, the simpler means wins — this is a recurring pattern in applied AI infrastructure.
- Task classification for routing purposes is itself a non-trivial AI problem, creating a recursive dependency (you need an LLM to decide which LLM to use).
- The experiment was valuable because it validated the cost savings hypothesis in principle while revealing that the deployment-time selection approach captures most of the benefit — we abandoned the complex approach and shipped the simple one.