Experiment Log

We publish everything.
Including failures.

Every experiment we run is logged here — completed, running, failed, abandoned. Negative results are results. The failures are often the most useful data.

4 completed·2 running·3 failed·1 abandoned

COMPLETEDAgent Architecture

Oct 2025 – Dec 2025

Multi-Agent Handoff Protocol v1

Hypothesis

A shared message queue reduces inter-agent latency by >30% compared to direct point-to-point agent communication.

Result

Reduced by 41% in 3-agent chains. Performance degraded significantly in 7+ agent chains — queue saturation at high concurrency introduced new bottlenecks we had not anticipated.

Key Learnings

Message queue architecture scales well up to 5-6 agents but requires a different topology beyond that — a hub-and-spoke model outperforms linear queues at scale.
Backpressure handling is the critical engineering problem in multi-agent systems; most early approaches ignore it until they hit production.
The 41% improvement confirms the hypothesis for the tested range, but the failure at 7+ agents is the more valuable finding — it defines the boundary conditions.

EXP-001Led to: Long-Context Agent Memory Architecture →

COMPLETEDApplied AI

Nov 2025 – Jan 2026

RAG Chunking Strategy Comparison

Hypothesis

Semantic chunking outperforms fixed-size chunking for technical documentation retrieval accuracy.

Result

Semantic chunking was 23% more accurate on technical Q&A benchmarks versus fixed-size chunking at 512 tokens. It indexed 8% slower. The accuracy gain justifies the indexing overhead for documentation use cases.

Key Learnings

Fixed-size chunking breaks mid-sentence and mid-concept at exactly the points where retrieval needs coherence — the failure mode is systematic, not random.
Semantic chunking gains are largest for long-form technical documentation (API references, architecture docs) and smallest for short-form structured content (FAQs, changelogs).
The 8% indexing overhead is a one-time cost that matters less than retrieval quality in most production scenarios; the tradeoff analysis depends heavily on update frequency.

EXP-002Led to: Cross-Model Prompt Portability →

COMPLETEDDeveloper Tooling

Sep 2025 – Nov 2025

Spec-to-Code Accuracy Measurement

Hypothesis

Structured specification documents reduce ambiguity and improve first-pass code accuracy compared to free-form brief inputs.

Result

34% reduction in revision cycles when using structured specs versus free-form briefs. First-pass acceptance rate improved from 51% to 79% across a sample of 40 feature implementations.

Key Learnings

The format of the spec matters as much as its completeness — specs with explicit acceptance criteria drove the largest improvements, not just specs with more detail.
Agent-generated specs (where the AI asks clarifying questions to build the spec) performed nearly as well as human-written specs, suggesting the spec-generation step itself can be automated.
The 34% reduction likely understates the full efficiency gain; revision cycles have non-linear cost — later revisions are exponentially more expensive than earlier ones.

EXP-003Led to: Automated Architecture Generation →

COMPLETEDIndustry Automation

Aug 2025 – Oct 2025

Industry Document Classification Pipeline

Hypothesis

LLMs can replace rule-based classifiers for business document routing with higher accuracy and lower maintenance overhead.

Result

95.3% accuracy on invoice/purchase order/receipt classification, outperforming our rule-based system at 89.1%. The LLM approach required zero maintenance when document formats changed — rule-based systems required manual rule updates for every format variation.

Key Learnings

The accuracy advantage is real but not the most important finding — the maintenance advantage is. Rule-based classifiers require constant upkeep as document formats drift; the LLM approach adapts without intervention.
The 4.7% error rate is concentrated in heavily damaged or non-standard documents that humans also struggle with — the system fails in the same places human judgment fails, which is the right failure profile.
Confidence scoring matters: routing high-confidence classifications automatically and flagging low-confidence items for human review brings effective accuracy above 99% in practice.

EXP-004Led to: Zero-Shot Industry Automation →

RUNNINGAgent Architecture

Feb 2026 – ongoing

Long-Context Agent Memory Architecture

Hypothesis

Hierarchical memory (working memory / episodic memory / semantic memory) enables coherent reasoning across 100+ step tasks where flat context windows fail.

Current Status

In progress. Early results suggest working memory compression at step transitions is the key engineering challenge. Preliminary data shows 60% fewer context loss errors versus flat context at 50+ steps.

Key Learnings

Episodic memory (storing summarized task history) is showing more value than expected — agents can reference earlier decisions without re-reading full context.
Memory retrieval timing is critical — retrieving too early or too late in the reasoning cycle introduces noise. Still characterizing the optimal retrieval window.

EXP-005

RUNNINGApplied AI

Jan 2026 – ongoing

Cross-Model Prompt Portability

Hypothesis

Prompts optimized for one model lose less than 15% effectiveness when ported to another model if they follow a structured format with explicit role, context, constraints, and output format sections.

Current Status

In progress. Testing across 4 model families on 12 task categories. Initial data shows structured prompts retain 83-91% effectiveness on transfer, versus 64-72% for unstructured prompts.

Key Learnings

Early finding: instruction-following and output-format prompts transfer well; reasoning-intensive prompts show the most degradation across models.
Model-specific idioms ("think step by step", chain-of-thought triggers) reduce portability significantly — need a model-agnostic instruction vocabulary.

EXP-006

FAILEDAgent Architecture

Jul 2025 – Aug 2025

Agent Self-Correction Loop

Hypothesis

An agent reviewing its own output before submission would reduce output errors by 50% versus no self-review.

Result

No statistically significant improvement. Agent self-reviews showed the same blind spots as the original outputs. Abandoned after 3 weeks of structured testing across 200 generation tasks.

Why it failed

The self-review mechanism was flawed by design: the same model that made an error in generation will make the same error in evaluation. Self-review is not an independent check — it is the same process run twice.

Key Learnings

LLMs exhibit systematic self-evaluation blind spots — errors in reasoning are not detectable by the same reasoning process that produced them. This is a fundamental constraint, not an implementation problem.
The failure clarifies the role of human review and multi-model validation: the only reliable error-catching mechanism is an independent evaluator with different priors.
The experiment ruled out an entire category of optimization approaches. That is genuinely useful: we can stop testing self-review variants and redirect research toward cross-model or human-in-the-loop verification.

EXP-007

FAILEDIndustry Automation

Nov 2025 – Dec 2025

Zero-Shot Industry Automation

Hypothesis

Zero-shot LLM prompting can replace fine-tuned models for structured data extraction from industry-specific documents, reducing the need for labeled training data.

Result

Accuracy dropped from 95% (fine-tuned model) to 71% (zero-shot) on edge cases. Not production-viable for high-stakes document processing where errors have financial or compliance consequences.

Why it failed

Edge cases in real-world documents require domain-specific pattern recognition that generalizes poorly from zero-shot prompting. The 24% accuracy gap concentrates on exactly the document types where errors matter most.

Key Learnings

Zero-shot works for common document patterns but degrades on domain-specific variants — the long tail of document formats is where fine-tuned models earn their keep.
The failure identifies a clear threshold: zero-shot is viable when error cost is low and coverage of common patterns is sufficient; fine-tuning is required when edge-case accuracy is critical.
Few-shot prompting with carefully selected examples recovers approximately half the accuracy gap (from 71% to ~83%) — a middle ground that may be viable for medium-stakes use cases.

EXP-008

FAILEDDeveloper Tooling

Dec 2025 – Jan 2026

Automated Architecture Generation

Hypothesis

LLM can generate production-ready system architecture from high-level requirements, reducing the architect review step to final sign-off rather than active design.

Result

Generated architectures were technically valid but missed domain-specific constraints in 60% of cases. Required full human architect review for every output — no efficiency gain over starting from scratch.

Why it failed

High-level requirements do not encode the organizational, compliance, operational, and political constraints that shape real architecture decisions. The LLM optimized for technical correctness while missing the constraints that actually determine what gets built.

Key Learnings

System architecture is constrained by factors outside the technical specification — existing infrastructure, team skill sets, regulatory requirements, vendor relationships. These are not capturable in a brief.
The experiment revealed that LLM architecture generation is most useful as a starting point for discussion, not as an output — it surfaces options and trade-offs that a human architect can then evaluate against real constraints.
A more promising direction: LLM as architecture reviewer rather than generator — checking proposed architectures against a library of anti-patterns and common failure modes.

EXP-009

ABANDONEDApplied AI

Sep 2025 – Oct 2025

Cross-Provider LLM Routing

Hypothesis

Dynamic routing between LLM providers based on task type and complexity reduces total inference cost by 40% with less than 5% accuracy degradation.

Result

The routing decision overhead — classification latency, routing logic, fallback handling — exceeded the cost savings in most real-world usage patterns. In high-volume scenarios the economics improved, but the engineering complexity was not justified.

Why it was abandoned

The simpler solution — selecting the appropriate provider at deployment time based on use-case characteristics — achieves most of the cost benefit without runtime routing complexity. We shipped the simpler approach.

Key Learnings

Dynamic routing adds latency and operational complexity. When the savings are real but achievable through simpler means, the simpler means wins — this is a recurring pattern in applied AI infrastructure.
Task classification for routing purposes is itself a non-trivial AI problem, creating a recursive dependency (you need an LLM to decide which LLM to use).
The experiment was valuable because it validated the cost savings hypothesis in principle while revealing that the deployment-time selection approach captures most of the benefit — we abandoned the complex approach and shipped the simple one.

EXP-010

Research philosophy

Why we publish failures.

Most research labs only publish successes. We think that is wrong — and counterproductive. Failed experiments contain some of the most useful information in applied AI research. They tell you where the edges of what is possible actually are.

“A failed experiment that is well-documented is worth more than ten successful experiments that are not.”

When we abandon a direction, we explain exactly why — not because it makes us look transparent, but because the next person working on the same problem deserves not to repeat the same three weeks we just spent. That is how the field moves forward.

We also publish running experiments because the process matters. Seeing what questions we are currently asking is as useful as knowing what we have already answered.

100%

Of experiments documented, regardless of outcome

Failed experiments that directly improved our process

40+

Hours of engineering time saved by published failure analyses

Cherry-picked results. The raw data is what it is.

Read our full research methodology→

We publish everything.Including failures.

Multi-Agent Handoff Protocol v1

RAG Chunking Strategy Comparison

Spec-to-Code Accuracy Measurement

Industry Document Classification Pipeline

Long-Context Agent Memory Architecture

Cross-Model Prompt Portability

Agent Self-Correction Loop

Zero-Shot Industry Automation

Automated Architecture Generation

Cross-Provider LLM Routing

Why we publish failures.

We publish everything.
Including failures.