Benchmark Results

CodeTether is benchmarked using Ralph's autonomous PRD loop against real coding tasks. Every story passes through four quality gates. No synthetic puzzles, no cherry-picking.

Methodology

Real PRDs

Each benchmark is a PRD with user stories, acceptance criteria, and dependency graphs. Not isolated function completions — full features with multi-file changes.

Quality Gates

Every story must pass cargo check, clippy, test, and build. No shortcuts. No partial credit.

Real Costs

Token counts from actual API calls. Cost computed using real-time pricing from models.dev. No estimates or projections.

How It Works

  1. 1.Load a benchmark PRD defining stories with acceptance criteria
  2. 2.Ralph's autonomous loop picks the next story (dependency-aware, priority-sorted)
  3. 3.The LLM implements the story using tools: edit, bash, search, glob, grep
  4. 4.Four quality gates run. Pass = story marked complete. Fail = retry (up to max iterations)
  5. 5.Results recorded: pass/fail, duration, tokens, cost, files changed

Model Comparison

Benchmark Suite

8 PRDs across 3 tiers — 34 total stories

T12 stories

Simple REST API

Health check and greeting endpoints with axum

t1-rest-api.json
T11 story

CLI Calculator

Four-operation calculator with clap, division-by-zero handling

t1-cli-tool.json
T11 story

JSON Config Parser

Config parsing with defaults, validation, error types

t1-json-parser.json
T24 stories

Todo CRUD API

Full CRUD with validation, pagination, thread-safe store

t2-todo-api.json
T23 stories

CSV to JSON

CSV parser, JSON writer, CLI interface with stats

t2-file-processor.json
T24 stories

Async Task Queue

State machine, concurrent queue, cancellation, metrics

t2-state-machine.json
T310 stories

Order Microservice

Event-driven orders: state machine, REST, events, search, middleware

t3-microservice.json
T39 stories

Plugin System

Dynamic plugins: registry, pipeline, lifecycle, events, config

t3-plugin-system.json

Cost Efficiency

Based on real benchmark token usage and live pricing from models.dev

ModelPass Rate$/Story$/Passed Story
Kimi K2.5
Moonshot AI
100%$0.19$0.19
Claude Sonnet 4
Anthropic
100%$0.84$0.84
DeepSeek R1
DeepSeek
90%$0.22$0.24
GPT-4.1
OpenAI
95%$1.12$1.18

Run Your Own Benchmarks

The benchmark suite is open source. Run it on your own models, your own hardware, with your own PRDs.

# Install CodeTether
curl -fsSL https://raw.githubusercontent.com/rileyseaburg/A2A-Server-MCP/main/scripts/install-agent.sh | bash

# Run benchmarks with your model
codetether benchmark \
  --prd-dir benchmarks/ \
  --models anthropic:claude-sonnet-4-20250514 \
  --cost-ceiling 20.0 \
  --output my_results.json