Agent reliability under noise; open-world self-evolution

A research-led Monday after a quiet weekend. The agent-reliability thread sharpened: When Tools Fail benchmarks dynamic replanning and anomaly recovery, ResearchClawBench evaluates end-to-end autonomous research, SubtleMemory tests fine-grained relational memory in long-horizon agents, and OpenSkill proposes open-world self-evolution for LLM agents past the usable-learning-loop assumption. Spatial reasoning came back into focus with Stream3D-VLM and Imaginative Perception Tokens. Robots Need More than VLA and World Models argues against the policy-scaling-is-enough framing for generalist robot intelligence. No press cycle on the day.

14 papers 0 news 1 sources ← Latest

Papers

9 items

Agent reliability under noise

Four benchmarks tighten what production-grade agent reliability means: dynamic replanning when tools fail, end-to-end autonomous scientific research, fine-grained relational memory over long horizons, and open-world self-evolution where the learning loop isn't guaranteed.

Paper Hugging Face

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Benchmarks how LLM agents replan and recover when tools misbehave — covering the failure modes 'happy-path' tool-use evals systematically miss.

Why it matters

Production tool calls fail in messy ways current benchmarks ignore.
Direct procurement-grade test for any vendor pitching long-horizon agentic automation.

Source → Arc

Paper Hugging Face

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Evaluates AI coding agents on end-to-end autonomous scientific research, not on isolated subtasks.

agents code evaluation benchmarks

Source → Arc

Paper Hugging Face

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

Tests whether long-horizon agents can keep fine-grained relational distinctions in memory across accumulating interactions.

agents memory evaluation

Source → Arc

Paper Hugging Face

OpenSkill: Open-World Self-Evolution for LLM Agents

Self-evolving agents that adapt after deployment without the curated learning loops most prior approaches assume.

agents reinforcement-learning

Source → Arc

Spatial reasoning pushes back

Stream3D-VLM brings 3D understanding online with incremental geometry priors, Imaginative Perception Tokens enhance VLM spatial reasoning, and Thinking with Imagination uses world simulators for agentic visual spatial reasoning. The week's earlier 'VLMs don't actually know spatial' critique is being answered.

Paper Hugging Face

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Brings 3D scene understanding online for 3D multimodal models — incremental geometry priors instead of requiring the whole scene up front.

vision-language multimodal robotics

Source → Arc

Paper Hugging Face

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Imaginative perception tokens give VLMs structured intermediate state for spatial reasoning when info isn't visible.

vision-language reasoning

Source → Arc

Paper Hugging Face

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Couples VLMs with world simulators to extend their visual spatial reasoning beyond observation.

vision-language world-models agents

Source → Arc

Robots need more than VLA + world models

A position paper argues generalist robot intelligence is not just a policy-scaling problem; AnchorWorld extends interactive world modeling with view-based evolution; Direct 3D-Aware Object Insertion ships a clean compositing tool.

Paper Hugging Face

Robots Need More than VLA and World Models

Argues generalist robot intelligence requires more than scaling vision-language-action and world models — task structure, planning, and abstraction matter.

robotics agents

Source → Arc

Paper Hugging Face

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Embodied egocentric world simulation that supports customizable view-based evolution.

world-models robotics

Source → Arc

Also today

Paper · Hugging Face Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills — Coding agents that self-evolve by extracting reusable skills from their own execution traces.
Paper · Hugging Face dots.tts Technical Report — Technical report on dots.tts — a 2B continuous autoregressive text-to-speech foundation model.
Paper · Hugging Face Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings — Shows that an LLM's unembedding matrix functions as a feature lens for downstream text embeddings.
Paper · Hugging Face LLM Explainability with Counterfactual Chains and Causal Graphs — Uses counterfactual chains and causal graphs to make LLM mechanisms transparent.
Paper · Hugging Face UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs — Benchmarks whether LLMs can capture true distributional randomness — relevant for agents that simulate uncertainty.
Paper · Hugging Face SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains — Reliable evaluation of proactive LLM mediation across domains and socio-cognitive variations.
Paper · Hugging Face GENEB: Why Genomic Models Are Hard to Compare — Catalogs why genomic foundation models are hard to benchmark — fragmented eval protocols and incompatible test sets.