When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Benchmarks how LLM agents replan and recover when tools misbehave — covering the failure modes 'happy-path' tool-use evals systematically miss.
Why it matters
- Production tool calls fail in messy ways current benchmarks ignore.
- Direct procurement-grade test for any vendor pitching long-horizon agentic automation.