Agents' Last Exam
Position-and-benchmark paper arguing strong scores on existing agent suites don't translate to real outcomes — proposes harder, broader evaluation.
Why it matters
- Names the leaderboard-vs-deployment gap that two weeks of agent benchmarks have been edging toward.
- Likely the next reference benchmark vendors will be graded against.
- Aligns with RAMP and TASTE — three converging arguments that current agent benchmarks are spent.