SocioFi
Technology

AI-Native Development: Human Verified

Skip to content
Labs/Research/Applied AI
02

Applied AI

ACTIVE4 articles published7 experiments logged

What does AI actually do well — and where does it fail — in real production software?

The gap between AI capability in research papers and AI capability inside real production software is larger than most practitioners expect. Models that score impressively on academic benchmarks perform very differently on real documents, real queries, and real user behaviour.

This stream investigates practical AI integration — not what models can theoretically do, but what they reliably do when connected to real data, exposed to real inputs, and asked to operate without a human in the loop to catch mistakes. We focus on document intelligence, structured output extraction, retrieval-augmented generation, and autonomous workflow execution.

The most valuable findings in this stream are the negative ones: places where AI fails in ways that look like success. A model that returns plausible-looking JSON that silently violates the schema. A RAG system that retrieves confidently but retrieves the wrong document. A prompt that worked last month and produces subtly worse output today after a model update nobody announced.

We publish these failure modes because they are predictable, reproducible, and fixable — but only if you know to look for them. The goal is to save practitioners from discovering them the hard way in production.

What we know so far

Key findings

Evidence

We ran the same question set against three different models connected to the same retrieval system, then ran the same models against retrieval systems of varying quality. Retrieval quality explained 3x more variance in answer accuracy than model capability. Investing in retrieval quality first is the correct order of priorities for most RAG deployments.

Methodology

We tested 3 models × 4 retrieval configurations on a 200-question evaluation set with ground-truth answers from a real client knowledge base, measuring exact match, semantic similarity, and factual grounding rate.

Evidence

Across 12 production structured extraction pipelines, an average of 11% of model outputs contained schema violations that were not surfaced as errors — they returned as syntactically valid JSON that violated semantic constraints (wrong enum values, plausible-but-incorrect field types, missing required conditional fields). These failures were invisible without explicit validation.

Methodology

We seeded 40 known schema violation patterns into test documents across varying complexity levels and measured what percentage of model outputs violated the schema without triggering a parsing error.

Evidence

We maintain a regression suite of 50 stable prompts with expected outputs. Running this suite after 6 model updates over 8 months detected measurable output drift in 4 of 6 updates — in 2 cases, the drift was large enough to break downstream logic in production systems. Model updates are not announced for self-hosted models and are inconsistently documented for hosted APIs.

Methodology

Automated regression suite running against a prompt bank with expected outputs, scored by exact match plus semantic similarity. Threshold for "drift detected" is a 5% or greater change in either metric.

Evidence

Across 5 production systems using self-consistency checking (running the same query multiple times and measuring agreement), confident hallucinations — cases where the model provided a definitive wrong answer — dropped by 83% on average. The cost increase of 40–60% is justified for high-stakes outputs but not for routine information retrieval.

Methodology

We ran A/B tests on production query traffic, splitting requests between standard single-pass inference and self-consistency checking (5 independent samples). Confident hallucinations were identified by cross-referencing outputs against verified source documents.

How we work

Methodology

We test against real document corpora from Studio projects (anonymised) and purpose-built adversarial datasets. No synthetic benchmarks — we test only conditions that reflect how these systems behave in production.

Experiment types
  • Document extraction accuracy testing across clean, semi-structured, and free-form layouts
  • Schema validation stress testing: seeding deliberate violations and measuring silent failure rates
  • RAG retrieval quality analysis: measuring precision and recall against gold-standard answer sets
  • Prompt regression testing: running stable prompts after model updates and measuring output drift
  • Hallucination rate measurement using self-consistency checking and fact verification against source documents
How we measure

Primary metrics are accuracy (against ground truth), false confidence rate (wrong answer presented with high confidence), and silent failure rate (technically valid output that violates intended semantics). We track these over time to detect regression after model updates.

Transparency commitment

We document model versions, prompt versions, and dataset characteristics for every experiment. When we discover that a finding changes after a model update, we publish an update to the original finding.

Radical transparency

Experiment log — including the failures

Semantic query routing pre-retrieval in RAG pipeline
Confirmed

Hypothesis: Routing queries to the most semantically relevant data source before retrieval will improve answer accuracy more than improving retrieval within a single source.

Result

Answer relevance improved 38% on a 200-question evaluation set with no increase in latency.

What we learned

Query routing is now a first-class step in every RAG pipeline we build. Routing based on query intent (procedural vs. factual vs. comparative) outperformed routing based on semantic similarity to source descriptions alone.

January 2026
Zero-shot document classification on free-form layouts
Partial result

Hypothesis: A well-prompted model can classify document types (invoice, contract, report, letter) with >90% accuracy without fine-tuning on domain-specific examples.

Result

94% accuracy on structured document types; 61% on free-form layouts without consistent visual structure.

What we learned

Zero-shot classification is reliable for templated documents and unreliable for free-form ones. We now use classification confidence to route low-confidence documents to a human-in-the-loop step before downstream processing.

November 2025
Full automation of recurring report generation
Partial result

Hypothesis: An AI pipeline can generate weekly operational reports from structured data sources with quality indistinguishable from human-written reports.

Result

Quantitative sections were rated equivalent to human-written. Narrative interpretation sections were rated significantly lower on coherence and appropriate emphasis.

What we learned

AI is reliable for data-to-text generation when the interpretation rules are explicit. Qualitative analysis that requires judgment about what matters requires human involvement. We now use AI for the data synthesis layer and human editors for the interpretation layer.

October 2025
Active investigations

What we are working on next

Continuous hallucination monitoring on live traffic

Building a lightweight hallucination detector that runs on sampled production traffic and alerts when the rate crosses a threshold — without requiring ground truth labels for every response.

Hybrid retrieval: dense + sparse indexing on real-world document corpora

Testing whether combining vector similarity retrieval with keyword (BM25) retrieval consistently improves recall on the long-tail queries that pure dense retrieval misses.

Browse all experiments