Agents' Last Exam, FlashMemory for ultra-long context, and why Muon beats Adam

Three pointed papers do the day's heavy lifting. Agents' Last Exam asks why benchmark wins don't translate to real outcomes and proposes harder, broader tests. FlashMemory-DeepSeek-V4 makes ultra-long-context inference workable with lookahead sparse attention, attacking the KV-cache bottleneck the silicon market repriced last week. Why Muon Outperforms Adam delivers a curvature-perspective explanation for Muon's ~2x training-efficiency gain. SpatialWorld and Skill-3D push interactive spatial benchmarks toward real-world tasks, and Echo-Memory tests memory mechanisms inside action-world models. Press cycle quiet.

14 papers 0 news 1 sources ← Latest

Papers

10 items

Agents saturate, then harden

Agents' Last Exam names the gap between benchmark gains and real-world outcomes; SWE-Explore probes how coding agents actually navigate large repos; SlimSearcher cuts the cost of deep research agents with adaptive reward gating; DuMate-DeepResearch builds an auditable rubric-grounded multi-agent system.

Paper Hugging Face

Agents' Last Exam

Position-and-benchmark paper arguing strong scores on existing agent suites don't translate to real outcomes — proposes harder, broader evaluation.

Why it matters

Names the leaderboard-vs-deployment gap that two weeks of agent benchmarks have been edging toward.
Likely the next reference benchmark vendors will be graded against.
Aligns with RAMP and TASTE — three converging arguments that current agent benchmarks are spent.

Source → Arc

Paper Hugging Face

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Benchmarks the exploration behavior of coding agents in real repositories — past SWE-bench-style single-task scoring.

agents code evaluation

Source → Arc

Paper Hugging Face

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Trains deep-research web agents with adaptive reward gating — cuts the compute cost of strong agentic search.

agents reinforcement-learning rag

Source → Arc

Paper Hugging Face

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Multi-agent deep-research system with recursive search and rubric-grounded reasoning — auditable by design.

agents rag evaluation

Source → Arc

Long-context inference and training-theory wins

FlashMemory-DeepSeek-V4 ships lookahead sparse attention for ultra-long context; End-to-End Context Compression at Scale attacks KV-cache growth structurally; Why Muon Outperforms Adam gives a curvature explanation for the optimizer that's been quietly displacing Adam in pretraining.

Paper Hugging Face

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Lookahead sparse attention plus a lightning index let DeepSeek-V4 serve ultra-long contexts without holding the full KV cache.

Why it matters

Directly targets the memory bottleneck XCENA, Groq, and TSMC's supply ceiling all priced in.
Open-weights frontier work continuing to drive practical long-context capability.

long-context attention inference open-weights

Source → Arc

Paper Hugging Face

End-to-End Context Compression at Scale

End-to-end learned context compression for long-context inference — addresses KV-cache growth without per-token surgery.

long-context inference

Source → Arc

Paper Hugging Face

Why Muon Outperforms Adam: A Curvature Perspective

Explains Muon's ~2x training-efficiency gain over Adam via a local-curvature analysis — clean theoretical backing for an optimizer already in widespread use.

training interpretability

Source → Arc

Spatial benchmarks and action world models

SpatialWorld pushes interactive spatial reasoning into real-world tasks; Skill-3D evolves agentic 3D skills; Latent Spatial Memory and Echo-Memory study memory in action world models; AHA-WAM proposes asynchronous horizon-adaptive world-action modeling.

Paper Hugging Face

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Spatial-reasoning benchmark grounded in real-world interactive tasks rather than synthetic 3D puzzles.

vision-language evaluation benchmarks

Source → Arc

Paper Hugging Face

Echo-Memory: A Controlled Study of Memory in Action World Models

Controlled study of how memory mechanisms shape action-conditioned world models.

world-models memory

Source → Arc

Paper Hugging Face

Latent Spatial Memory for Video World Models

Latent spatial memory replaces explicit point clouds for 3D-consistent video world models.

world-models video-generation memory

Source → Arc

Also today

Paper · Hugging Face Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short — Trace tournaments as an alternative when RLVR signals are absent — useful in non-verifiable domains.
Paper · Hugging Face Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses — Posterior-guided agent skill evolution driven by Bayesian updates over harness feedback.
Paper · Hugging Face Whisper Hallucination Detection and Mitigation via Hidden Representation Steering — Detects and mitigates Whisper ASR hallucinations via hidden-rep steering and sparse autoencoders — practical for any team running Whisper in production.
Paper · Hugging Face Text-to-Image Models Need Less from Text Encoders Than You Think — Shows that T2I models extract less from text encoders than commonly assumed — implications for cheaper conditioning.
Paper · Hugging Face A Geometric Account of Activation Steering through Angle-Norm Decomposition — Angle-norm decomposition gives a clean geometric account of linear activation steering — why it works and where it fails.
Paper · Hugging Face LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents — Internalizes textual agent skills as latent in-weight skills — fewer tokens, fewer prompt edits.
Paper · Hugging Face CoVEBench: Can Video Editing Models Handle Complex Instructions? — Tests text-guided video editing on complex, multi-step instructions beyond style transfer and single object insertion.
Paper · Hugging Face OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents — Unified UE5 game benchmark for VLM agents with improvement dynamics.
Paper · Hugging Face Human Psychometric Questionnaires Mischaracterize LLM Behavior — Argues human-psychometric questionnaires are poor proxies for predicting LLM behavior — current 'AI personality' tests miss the mark.
Paper · Hugging Face Answer Presence Drives RAG Rewriting Gains — Shows that RAG rewriter gains are mostly explained by answer presence — useful diagnostic for RAG pipelines.
Paper · Hugging Face AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling — Asynchronous, horizon-adaptive world-action models with observation-guided context routing for manipulation.