Define custom evals
Track performance of your agents over time
BuildBench is the modern TypeScript framework for building, benchmarking and shipping AI agents — locally first, in production always.
Agents. Workflows. RAG. Memory. Tools. MCP. BuildBench lets you go from idea to a passing benchmark.
Define agents and workflows in TypeScript. Iterate against the model providers and tools you already use — and watch the bench score climb.
Visualise traces, run evals and tune prompts in a studio that ships with the framework.
Tune context. Improve recall. Tweak until your agents achieve human‑level accuracy.
Track performance of your agents over time
View traces and logs for your agents
Expose your agents as APIs, or bundle them with your app. With BuildBench, your agents are part of your infrastructure.
Control your source code and infrastructure
Deploy BuildBench agents wherever you’re hosting your app, or as a standalone service
with Templates · read the Bench Book · learn with the Tutorial · watch the Workshop · tune in to Bench Hour
An agent that controls a real browser to research and act on the open web — scored against a click-trace harness.
Template
Hand a spreadsheet to an agent that summarises trends and exports a tidy report — measured for table-faithfulness.
Template
A natural-language SQL agent over your warehouse with grounded answers, citations and an automatic recall bench.
A field-tested guide for engineers shipping production agents — and the suites that grade them.
Read the book →Learn to wire up agents, tools and MCP servers — and grade them against a real benchmark — in a hands-on CLI course.
Start the course →