Labs · Open Source

Open source tools from the Labs team.

We build things for our own work, then release them. No strings. MIT or Apache licensed. These tools run in our own production systems before anyone else touches them.

6 repos·1,529+ total stars·4 languages·3 licenses

agent-eval

TOOLINGACTIVE

Evaluation harness for multi-agent systems

Standardized benchmarks for testing multi-agent coordination, tool use reliability, and failure recovery. 47 built-in test scenarios covering common failure modes in production agent systems.

prompt-guard

TOOLINGACTIVE

Prompt injection detection and sanitization

Detects and blocks prompt injection attempts with <2ms latency. 94% accuracy on our benchmark dataset. Supports custom allowlists and configurable sensitivity thresholds for different deployment contexts.

rag-bench

BENCHMARKACTIVE

Benchmarking suite for RAG pipelines

Compare retrieval strategies, chunking methods, and embedding models on standardized datasets. Used by 200+ teams to evaluate RAG configurations before production deployment.

spec-runner

FRAMEWORKACTIVE

Spec-to-test pipeline

Convert natural language specifications into executable test suites. Integrates with Jest and Vitest. Extracts behavioral requirements from specs and generates property-based and example-based tests.

industry-datasets

DATASETACTIVE

Curated labeled datasets for business document automation

Annotated real-world business documents for training and evaluating automation models. 50K+ examples across invoices, purchase orders, contracts, and insurance forms. CC-BY licensed.

flow-tracer

TOOLINGACTIVE

Observability for multi-step AI workflows

Trace, visualize, and debug complex agent pipelines. OpenTelemetry compatible. Captures tool call inputs and outputs, latency at each step, and token usage per agent. Works with any orchestration framework.

Contributing

Contributions
are welcome.

Every project has a CONTRIBUTING.md with setup instructions and contribution guidelines. We review pull requests from the community and typically respond within 3 business days.

Contribution guide

Fork the repository and create a branch from main. Name it descriptively.

Run the test suite locally before making changes. All contributions must pass existing tests and include tests for new behavior.

Open a pull request with a clear description of what you changed and why. Reference any related issues.

Address review feedback. Our team reviews all PRs and may request changes or clarification before merging.

Why we open source

Tools we built
for ourselves first.

Every project in this catalog started as internal tooling. We built agent-eval because we needed a standardized way to measure tool use reliability across client systems. We built prompt-guard because we were handling prompt injection incidents manually and it was not sustainable. We built flow-tracer because debugging production agent pipelines without observability was costing us hours per incident.

"These tools run in our own production systems first. We release them when they are ready for others — when the rough edges are smoothed out and the documentation is good enough that someone outside our team can use them without help."

We release under permissive licenses (MIT, Apache 2.0, CC-BY 4.0) because we believe the tooling ecosystem for AI engineering is still early and fragmented. Keeping useful tools proprietary slows the whole field down. There is nothing in these repos that is a competitive differentiator for us — our advantage is in how we apply these tools, not in the tools themselves.

We maintain these repositories because we still use them. If we stop using a tool internally, we will archive the repository and say so clearly. We will not let projects rot silently.

Open source tools from the Labs team.

Contributionsare welcome.

Tools we builtfor ourselves first.

Contributions
are welcome.

Tools we built
for ourselves first.