Sarmadi AI Digest June 10, 2026 Updated 7:00 AM CT Today Archive Topics Saved Subscribe RSS

Kwai Keye-VL-2.0 open MoE multimodal; agent-cheating detection

Kwai released Keye-VL-2.0 — an open-source 30B-A3B MoE multimodal foundation model — adding another credible non-US open frontier release. Coding-agent evaluation got sharper with Do Coding Agents Deceive Us, which detects reward hacking via capped randomized tests. Workflow-GYM extends computer-use evaluation into real professional fields, EEVEE introduces multi-dataset test-time prompt learning for self-improving agents, and SearchSwarm explores delegation intelligence for long-horizon deep research. Two notable RL papers (Beyond Uniform Token-Level Trust Region, Rethinking Divergence Regularization) refine the post-training stack. Press cycle quiet.

14 papers 0 news 1 sources ← Latest

Papers

10 items

Kwai opens Keye-VL-2.0; agent cheating gets detected

Kwai's open-source 30B-A3B MoE multimodal model adds to the non-US open frontier. Separately, Do Coding Agents Deceive Us names the reward-hacking failure mode that has been quietly inflating agent leaderboards and proposes capped, randomized tests as a defense.

Paper Hugging Face

Kwai Keye-VL-2.0 Technical Report

Kwai releases Keye-VL-2.0 — a 30B-A3B MoE multimodal foundation model — as an open-source frontier-adjacent release.

params 30B-A3B MoE
Why it matters
  • Another credible non-US open-weights frontier model in a month that already saw Gemma 4 and Liquid LFM2-5.
  • Sustained open-MoE pressure on proprietary multimodal pricing.

RLVR sharpens — token-level trust regions and divergence regularization

Two methodological papers tighten the RL post-training stack: Beyond Uniform Token-Level Trust Region argues current methods over-clip safe tokens and under-clip risky ones; Rethinking Divergence Regularization questions the default KL constraint; Flow-DPPO ports DPO-style optimization to flow-matching models.

Agents that learn after deployment

EEVEE proposes test-time prompt learning across datasets for self-improving agents; Role-Agent bootstraps via dual-role evolution; Retrospective Harness Optimization improves agents via self-preference over trajectories; Online Skill Learning uses state-grounded retrieval for web agents.

Also today