Benchmark Results

CodeTether is benchmarked using Ralph's autonomous PRD loop against real coding tasks. Every story passes through four quality gates. No synthetic puzzles, no cherry-picking.

Methodology

Real PRDs

Each benchmark is a PRD with user stories, acceptance criteria, and dependency graphs. Not isolated function completions — full features with multi-file changes.

Quality Gates

Every story must pass cargo check, clippy, test, and build. No shortcuts. No partial credit.

Real Costs

Token counts from actual API calls. Cost computed using real-time pricing from models.dev. No estimates or projections.

How It Works

1.Load a benchmark PRD defining stories with acceptance criteria
2.Ralph's autonomous loop picks the next story (dependency-aware, priority-sorted)
3.The LLM implements the story using tools: edit, bash, search, glob, grep
4.Four quality gates run. Pass = story marked complete. Fail = retry (up to max iterations)
5.Results recorded: pass/fail, duration, tokens, cost, files changed

Model Comparison

Benchmark Suite

8 PRDs across 3 tiers — 34 total stories

T12 stories

Simple REST API

Health check and greeting endpoints with axum

t1-rest-api.json

T11 story

CLI Calculator

Four-operation calculator with clap, division-by-zero handling

t1-cli-tool.json

T11 story

JSON Config Parser

Config parsing with defaults, validation, error types

t1-json-parser.json

T24 stories

Todo CRUD API

Full CRUD with validation, pagination, thread-safe store

t2-todo-api.json

T23 stories

CSV to JSON

CSV parser, JSON writer, CLI interface with stats

t2-file-processor.json

T24 stories

Async Task Queue

State machine, concurrent queue, cancellation, metrics

t2-state-machine.json

T310 stories

Order Microservice

Event-driven orders: state machine, REST, events, search, middleware

t3-microservice.json

T39 stories

Plugin System

Dynamic plugins: registry, pipeline, lifecycle, events, config

t3-plugin-system.json

Cost Efficiency

Based on real benchmark token usage and live pricing from models.dev

Model	Pass Rate	$/Story	Tokens/Story	Speed	$/Passed Story
Kimi K2.5 Moonshot AI	100%	$0.19	25k	40.7/hr	$0.19
Claude Sonnet 4 Anthropic	100%	$0.84	30k	32.1/hr	$0.84
DeepSeek R1 DeepSeek	90%	$0.22	28k	35.2/hr	$0.24
GPT-4.1 OpenAI	95%	$1.12	35k	28.5/hr	$1.18

Run Your Own Benchmarks

The benchmark suite is open source. Run it on your own models, your own hardware, with your own PRDs.

# Install CodeTether
curl -fsSL https://raw.githubusercontent.com/rileyseaburg/A2A-Server-MCP/main/scripts/install-agent.sh | bash

# Run benchmarks with your model
codetether benchmark \
  --prd-dir benchmarks/ \
  --models anthropic:claude-sonnet-4-20250514 \
  --cost-ceiling 20.0 \
  --output my_results.json

Deploy CodeTether View Benchmark PRDs