Long-horizon agent oversight and the KV-cache squeeze

Another research-led day, and the strongest signal is about oversight. SpecBench measures reward hacking in long-horizon coding agents, SaaSBench probes enterprise-scale coding work, and MINTEval tests whether agent memory survives multi-target interference — together a sober counterweight to the agent-productivity story. Inference efficiency drew a cluster of KV-cache work (OScaR, OCTOPUS, Mix-Quant) as long-context and agentic workloads make cache memory the binding constraint. The RLVR literature kept maturing toward theory, with results on its unlearnability limits and the conditional equivalence of DPO and RLHF. Stability AI also shipped Stable Audio 3. The press cycle stayed quiet over the weekend run-up.

15 papers 0 news 1 sources ← Latest

Papers

13 items

Oversight for long-horizon agents

As coding and operations agents produce more output than humans can review, oversight collapses onto the test suite — and the test suite can be gamed. New benchmarks measure reward hacking, enterprise-scale coding limits, planning, and memory interference, sharpening the gap between demo and deployable.

Paper Hugging Face

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Benchmark for reward hacking when long-horizon coding agents produce more code than anyone can review and oversight collapses onto the automated test suite.

Why it matters

Names and measures the exact failure mode that scares teams adopting autonomous coding agents.
Test-suite-as-sole-oversight is the default in CI; this shows where it breaks.
Direct procurement-grade signal for anyone evaluating agentic coding vendors.

Source → Arc

Paper Hugging Face

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

Benchmark for end-to-end enterprise SaaS engineering tasks — exposes where coding agents stall on realistic long-horizon work.

agents code evaluation benchmarks

Source → Arc

Paper Hugging Face

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Tests whether agent memory stays accurate when information is repeatedly updated and interferes across stored memories.

agents memory evaluation

Source → Arc

Paper Hugging Face

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training LLMs

Generates scalable, verifiable planning problems for both evaluating and training the goal/constraint coordination LLMs need for agentic work.

agents evaluation benchmarks reasoning

Source → Arc

Paper Hugging Face

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

GUI-agent benchmark for professional media post-production — pushes agents past web navigation into creative-tool workflows.

agents vision-language benchmarks

Source → Arc

The KV-cache squeeze

Long-context and agentic inference have made KV-cache memory the dominant cost. Three papers attack it from different angles: extreme cache quantization, octahedral-parametrized codecs, and quantized prefill with precise decoding tuned for agent workloads.

Paper Hugging Face

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Extreme KV-cache quantization that targets the memory footprint dominating long-context and multimodal inference.

Why it matters

KV cache is now the binding constraint on serving long-context agents cheaply.
Aggressive quantization here directly lowers the cost floor for SMB-scale deployments.

inference quantization long-context

Source → Arc

Paper Hugging Face

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Quantizes the prefill phase while keeping decode precise — tuned for the planning/tool-use/memory pattern of agentic LLMs.

inference quantization agents

Source → Arc

Paper Hugging Face

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

I/O-aware expert offload makes mixture-of-experts diffusion LLM inference efficient without quality loss.

inference mixture-of-experts diffusion

Source → Arc

RLVR and alignment theory mature

The post-training literature is turning theoretical: results on RLVR's unlearnability limits, the conditional equivalence of DPO and RLHF, and minimal rank-1 RLVR training all try to explain why these methods work and where they fail, rather than just reporting wins.

Paper Hugging Face

The Unlearnability Phenomenon in RLVR for Language Models

Identifies cases where RLVR cannot teach a behavior at all — a limit on what verifiable-reward post-training can achieve.

reinforcement-learning reasoning

Source → Arc

Paper Hugging Face

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Shows when DPO and RLHF are equivalent, where the equivalence breaks, and what that implies for provable alignment.

alignment rlhf fine-tuning

Source → Arc

Paper Hugging Face

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Argues most RLVR gains can be reached with minimal, rank-1-trajectory training — echoing the sparse-policy-selection thesis.

reinforcement-learning fine-tuning

Source → Arc

Audio foundation models step up

Stability AI shipped Stable Audio 3, a fast latent-diffusion family for variable-length audio generation and editing, alongside a real-world ASR scaling effort and a survey mapping the large-audio-model landscape.

Paper Hugging Face

Stable Audio 3

A family of fast latent-diffusion models (small/medium/large) for variable-length audio generation and editing.

audio diffusion open-weights

Source → Arc

Paper Hugging Face

Mega-ASR: Towards In-the-wild² Speech Recognition via Scaling up Real-world Acoustic Simulation

Scales real-world acoustic simulation to improve robust speech recognition in difficult environments.

asr speech audio

Source → Arc

Also today

Paper · Hugging Face Toto 2.0: Time Series Forecasting Enters the Scaling Era — Time-series foundation models scale cleanly from 4M to 2.5B parameters with one training recipe.
Paper · Hugging Face Mem-π: Adaptive Memory through Learning When and What to Generate — Agent memory that generates guidance on demand rather than retrieving from a fixed store.
Paper · Hugging Face A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook — Survey mapping the large-audio-language-model landscape across generalization and trustworthiness.
Paper · Hugging Face It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs — Treats privacy as contextual integrity — governing information flow by norms — and trains for it via complementary self-distillation.
Paper · Hugging Face Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection — Frames safety post-training as continual learning, using orthogonal gradient projection to limit the utility loss from alignment.
Paper · Hugging Face On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists — 45 expert scientists assess AI peer reviewers against Nature-family reviews — a measured read on where AI review helps and fails.
Paper · Hugging Face Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining — Synthesizes GUI interaction trajectories from video to pretrain more generalizable GUI agents.
Paper · Hugging Face Generative Recursive Reasoning — Recursive Reasoning Models as an alternative to autoregressive extended computation.
Paper · Hugging Face IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools — Agentic tool use for open-vocabulary industrial anomaly detection.
Paper · Hugging Face PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis — Generates consistent whole-house VR tours from a floorplan and style reference with cross-view spatial coherence.