SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
Benchmark for reward hacking when long-horizon coding agents produce more code than anyone can review and oversight collapses onto the automated test suite.
Why it matters
- Names and measures the exact failure mode that scares teams adopting autonomous coding agents.
- Test-suite-as-sole-oversight is the default in CI; this shows where it breaks.
- Direct procurement-grade signal for anyone evaluating agentic coding vendors.