Benchmark Results
CodeTether is benchmarked using Ralph's autonomous PRD loop against real coding tasks. Every story passes through four quality gates. No synthetic puzzles, no cherry-picking.
Methodology
Real PRDs
Each benchmark is a PRD with user stories, acceptance criteria, and dependency graphs. Not isolated function completions — full features with multi-file changes.
Quality Gates
Every story must pass cargo check, clippy, test, and build. No shortcuts. No partial credit.
Real Costs
Token counts from actual API calls. Cost computed using real-time pricing from models.dev. No estimates or projections.
How It Works
- 1.Load a benchmark PRD defining stories with acceptance criteria
- 2.Ralph's autonomous loop picks the next story (dependency-aware, priority-sorted)
- 3.The LLM implements the story using tools: edit, bash, search, glob, grep
- 4.Four quality gates run. Pass = story marked complete. Fail = retry (up to max iterations)
- 5.Results recorded: pass/fail, duration, tokens, cost, files changed
Model Comparison
Benchmark Suite
8 PRDs across 3 tiers — 34 total stories
Simple REST API
Health check and greeting endpoints with axum
t1-rest-api.jsonCLI Calculator
Four-operation calculator with clap, division-by-zero handling
t1-cli-tool.jsonJSON Config Parser
Config parsing with defaults, validation, error types
t1-json-parser.jsonTodo CRUD API
Full CRUD with validation, pagination, thread-safe store
t2-todo-api.jsonCSV to JSON
CSV parser, JSON writer, CLI interface with stats
t2-file-processor.jsonAsync Task Queue
State machine, concurrent queue, cancellation, metrics
t2-state-machine.jsonOrder Microservice
Event-driven orders: state machine, REST, events, search, middleware
t3-microservice.jsonPlugin System
Dynamic plugins: registry, pipeline, lifecycle, events, config
t3-plugin-system.jsonCost Efficiency
Based on real benchmark token usage and live pricing from models.dev
| Model | Pass Rate | $/Story | Tokens/Story | Speed | $/Passed Story |
|---|---|---|---|---|---|
Kimi K2.5 Moonshot AI | 100% | $0.19 | 25k | 40.7/hr | $0.19 |
Claude Sonnet 4 Anthropic | 100% | $0.84 | 30k | 32.1/hr | $0.84 |
DeepSeek R1 DeepSeek | 90% | $0.22 | 28k | 35.2/hr | $0.24 |
GPT-4.1 OpenAI | 95% | $1.12 | 35k | 28.5/hr | $1.18 |
Run Your Own Benchmarks
The benchmark suite is open source. Run it on your own models, your own hardware, with your own PRDs.
# Install CodeTether
curl -fsSL https://raw.githubusercontent.com/rileyseaburg/A2A-Server-MCP/main/scripts/install-agent.sh | bash
# Run benchmarks with your model
codetether benchmark \
--prd-dir benchmarks/ \
--models anthropic:claude-sonnet-4-20250514 \
--cost-ceiling 20.0 \
--output my_results.json