We publish our benchmark results publicly because we think the industry needs more honesty about what AI development tools can and cannot do. These are real numbers from real workloads, updated quarterly.
Each benchmark measures a specific capability against labelled data from real Studio projects. Methodology notes follow each card.
Benchmarks are only as useful as their methodology is honest. Here is exactly how these numbers are collected and what "updated quarterly" means in practice.
All benchmarks use real tasks from Studio client projects, anonymised and with client permission. Synthetic test cases are used only to supplement, never as the primary measurement.
Where possible, evaluation is done by an engineer who does not know whether the output was AI-generated or human-written. This removes confirmation bias from quality assessments.
Numbers are recalculated every quarter with fresh data. Published benchmarks include the measurement date. We do not retroactively improve historical figures.
Every failure is classified by type — model error, prompt design, tool failure, evaluation error. This lets us attribute improvements to their actual causes.
We believe documenting limitations is as important as documenting results. Read these before drawing conclusions from the numbers above.
Benchmark results vary significantly by codebase maturity, domain, and task complexity. Do not extrapolate these numbers to your specific project without understanding the measurement conditions.
High test coverage does not mean the tests are testing the right things. Our 94% coverage figure measures line coverage, not semantic correctness of the test assertions.
Every metric has a strong complexity dependency. Simple tasks score significantly higher. Our reported numbers are averages across complexity bands — the distribution is wide.
AI model capabilities improve (and occasionally regress) with new versions. Our benchmarks reflect the model versions in use at measurement time, noted in each benchmark record.
A direct comparison across time, cost, and quality dimensions. We include dimensions where traditional development wins — because pretending otherwise helps nobody.
| Dimension | Traditional development | AI-native pipeline |
|---|---|---|
| Time to first working prototype | 4–8 weeks | 5–10 days |
| Code generation throughput | ~400 lines/day | 2,000–5,000 lines/day |
| Test coverage on new code | 60–75% | 88–96% |
| Documentation completeness | 40–60% | 85–92% |
| Cost per feature (small) | $800–$2,000 | $200–$600 |
| Architectural decision quality | High (with senior devs) | Medium (human oversight needed) |
| Novel problem-solving | High | Medium (pattern-dependent) |
| Security audit depth | High (with specialists) | Medium (known patterns only) |
| Long-term maintenance cost | Variable | Lower (consistent patterns) |
| Deployment reliability | 94–98% | 98.5–99.5% |
Traditional development figures based on industry surveys (Stack Overflow Developer Survey 2025, GitLab DevSecOps Report 2025) and our own experience running hybrid projects. AI-native figures are our internal measurements. Both assume competent practitioners.