Sarmadi AI Digest May 21, 2026 Updated 7:00 AM CT Today Archive Topics Saved Subscribe RSS

Long-horizon agent oversight and the KV-cache squeeze

Another research-led day, and the strongest signal is about oversight. SpecBench measures reward hacking in long-horizon coding agents, SaaSBench probes enterprise-scale coding work, and MINTEval tests whether agent memory survives multi-target interference — together a sober counterweight to the agent-productivity story. Inference efficiency drew a cluster of KV-cache work (OScaR, OCTOPUS, Mix-Quant) as long-context and agentic workloads make cache memory the binding constraint. The RLVR literature kept maturing toward theory, with results on its unlearnability limits and the conditional equivalence of DPO and RLHF. Stability AI also shipped Stable Audio 3. The press cycle stayed quiet over the weekend run-up.

15 papers 0 news 1 sources ← Latest

Papers

13 items

Oversight for long-horizon agents

As coding and operations agents produce more output than humans can review, oversight collapses onto the test suite — and the test suite can be gamed. New benchmarks measure reward hacking, enterprise-scale coding limits, planning, and memory interference, sharpening the gap between demo and deployable.

Paper Hugging Face

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Benchmark for reward hacking when long-horizon coding agents produce more code than anyone can review and oversight collapses onto the automated test suite.

Why it matters
  • Names and measures the exact failure mode that scares teams adopting autonomous coding agents.
  • Test-suite-as-sole-oversight is the default in CI; this shows where it breaks.
  • Direct procurement-grade signal for anyone evaluating agentic coding vendors.

The KV-cache squeeze

Long-context and agentic inference have made KV-cache memory the dominant cost. Three papers attack it from different angles: extreme cache quantization, octahedral-parametrized codecs, and quantized prefill with precise decoding tuned for agent workloads.

RLVR and alignment theory mature

The post-training literature is turning theoretical: results on RLVR's unlearnability limits, the conditional equivalence of DPO and RLHF, and minimal rank-1 RLVR training all try to explain why these methods work and where they fail, rather than just reporting wins.

Audio foundation models step up

Stability AI shipped Stable Audio 3, a fast latent-diffusion family for variable-length audio generation and editing, alongside a real-world ASR scaling effort and a survey mapping the large-audio-model landscape.

Also today