A research-heavy midweek: verifiable agent worlds and smarter RLVR

A paper-dominated day on the research feeds. The clearest throughline is verifiability: OpenComputer and EnvFactory build executable, checkable software worlds for computer-use agents, while a cluster of RLVR papers attacks the crude single-reward signal with learned-reliability process rewards, policy-aware rubrics, and contrastive evidence optimization. Video generation kept pushing past plausibility toward verification, with work on reasoning under verifiable rewards and benchmarks for detecting AI-video artifacts. Evaluation itself is being reframed as a design science rather than a leaderboard. The press cycle was quiet; the signal today is in the methods.

16 papers 0 news 1 sources ← Latest

Papers

15 items

Verifiable worlds for computer-use agents

Two frameworks build executable, checkable environments so agent training and evaluation can be grounded in real outcomes rather than synthetic stubs. Paired with new GUI and long-context agent work, the day's agent research is converging on verifiability and reusable skills.

Paper Hugging Face

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

A verifier-grounded framework for constructing checkable software worlds where computer-use agents can be trained and evaluated against real outcomes.

Why it matters

Replaces mock-service sandboxes with verifiable environments — closer to how agents actually fail in production.
Provides reusable infrastructure smaller teams can train computer-use agents against.
Fits the week's controlled-to-realistic evaluation trend.

Source → Arc

Paper Hugging Face

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Synthesizes executable environments at scale to unblock the two main bottlenecks in agentic RL: robust execution and reward signal.

agents tool-use reinforcement-learning

Source → Arc

Paper Hugging Face

Harnessing LLM Agents with Skill Programs

Derives reusable skill programs from past experience to tackle long-horizon tasks more reliably than re-planning each time.

agents tool-use memory

Source → Arc

Paper Hugging Face

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

GUI-agent benchmark using live omni-modal smartphone interaction rather than static screenshots.

agents vision-language benchmarks evaluation

Source → Arc

Paper Hugging Face

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Maintains a reusable orientation cache across invocations so long-context agents don't re-explore the same corpora or repos each time.

agents long-context memory

Source → Arc

RLVR gets a better reward signal

Several papers attack the same weakness: RLVR's single binary reward gives every token the same credit. Learned-reliability process rewards, policy-aware rubric rewards, and contrastive evidence optimization all try to make the signal reflect which steps actually mattered.

Paper Hugging Face

Process Rewards with Learned Reliability

Process reward models that emit a reliability estimate alongside each step score, so downstream methods can weight feedback by confidence.

Why it matters

Addresses the brittleness of single-score PRMs in reasoning pipelines.
Reliability-aware rewards reduce the noise that has limited process supervision.

reinforcement-learning reasoning fine-tuning

Source → Arc

Paper Hugging Face

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Weights rubric-based rewards by what the current policy actually needs to learn, extending RLVR beyond automatically-checkable correctness.

reinforcement-learning fine-tuning

Source → Arc

Paper Hugging Face

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Differentiates token credit within a correct RLVR rollout via contrastive evidence, instead of rewarding every token equally.

reinforcement-learning distillation reasoning

Source → Arc

Paper Hugging Face

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Fully open-source RLVR post-training recipe aimed specifically at long-context capability.

reinforcement-learning long-context open-weights

Source → Arc

Paper Hugging Face

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Shows that inserting self-generated data before and during RL improves outcomes by shaping data diversity.

reinforcement-learning training data

Source → Arc

Video generation pushes past plausibility

Video models are being pulled from 'looks real' toward 'is correct': reasoning under verifiable rewards, controllable generation from sparse intent, and benchmarks that score artifacts and multi-shot coherence rather than single-clip realism.

Paper Hugging Face

Video Models Can Reason with Verifiable Rewards

Applies verifiable-reward training to video diffusion so models optimize for verifiable correctness, not just plausible frames.

video-generation reinforcement-learning reasoning

Source → Arc

Paper Hugging Face

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Benchmark for whether multimodal models can detect temporal and structural artifacts in AI-generated video — a prerequisite for trustworthy video QA.

video-generation evaluation benchmarks multimodal

Source → Arc

Paper Hugging Face

Aurora: Unified Video Editing with a Tool-Using Agent

Wraps unified diffusion video editing in a tool-using agent loop for more controllable, multi-step edits.

video-generation agents

Source → Arc

Evaluation becomes a design science

As models act over time through tools and users, static benchmarks lose meaning. A position paper argues interactive evaluation needs design-science rigor, while ThoughtTrace captures what users actually think (not just say) and Editor's Choice tests abstract editing intent.

Paper Hugging Face

Interactive Evaluation Requires a Design Science

Argues that evaluating agents acting over time through tools and users requires design-science methods, not static leaderboards.

evaluation agents

Source → Arc

Paper Hugging Face

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

First large-scale dataset capturing what users think (not only what they type) during real LLM interactions.

evaluation data

Source → Arc

Also today

Paper · Hugging Face AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration — Frames autonomous research as iterative hypothesis-challenge-experiment loops with human collaboration, not one-shot paper generation.
Paper · Hugging Face When Vision Speaks for Sound — Shows video MLLMs often fake audio understanding by leaning on visual cues — a measurement trap for omni-modal claims.
Paper · Hugging Face PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset — Pushes native text-to-image generation to 100-megapixel ultra-high resolution with a new large dataset.
Paper · Hugging Face Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding — Hybrid draft-and-retrieve tree construction raises speculative-decoding acceptance rates.
Paper · Hugging Face Delta Attention Residuals — Replaces additive residuals with learned softmax attention over previous layers for selective cross-layer routing.
Paper · Hugging Face ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop — Benchmark for embodied agents that must act to acquire observations and reason about how observations change with action.
Paper · Hugging Face TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization — Out-of-core optimization scales 3D Gaussian Splatting past one billion primitives despite the memory bottleneck.