Musk loses the OpenAI suit; Anthropic buys Stainless

A nine-person jury took two hours to deliver a unanimous verdict against Elon Musk in Musk v. Altman, the judge affirmed immediately, and Musk plans to appeal — the founding-myth question is legally closed, the governance question is not. Anthropic acquired Stainless, the dev-tools startup whose API SDK generators are already shipping inside OpenAI, Google, and Cloudflare — a quiet M&A that gives Anthropic ownership of a piece of its competitors' infrastructure. SandboxAQ moved its drug-discovery models inside Claude as a first-class capability, the second Anthropic vertical play this week after Legal and Small Business. Underneath, the research wave consolidated around vertical benchmarks (healthcare workflows, finance domain knowledge, tool-using agents) and a sharp reality check: a new web-app benchmark finds generated applications fail functional requirements in over 70% of cases, and bug-bounty programs are being overwhelmed by AI-generated slop submissions.

11 papers 13 news 8 sources ← Latest

News

11 items

Musk loses the OpenAI suit

A unanimous nine-juror verdict against Musk ended the trial in under two hours of deliberation. The judge immediately affirmed the advisory verdict. Musk plans to appeal, but the courtroom phase of the OpenAI founding dispute is over and the company exits the case with its restructuring intact.

News TechCrunch AI

Elon Musk has lost his lawsuit against Sam Altman and OpenAI

Nine California jurors unanimously rejected Musk's claim that he was mistreated by his OpenAI co-founders; the suit is over absent a successful appeal.

Why it matters

Removes the largest near-term legal threat to OpenAI's nonprofit-to-for-profit transition.
Closes the most cited governance overhang in the AI capital markets right now.
Settles the founding narrative in OpenAI's favor — Altman's public credibility comes out improved on paper.

Source →

News The Verge AI

Elon Musk loses his case against Sam Altman

The Verge's report on the verdict — two hours of deliberation, unanimous against Musk.

policy market

Source →

News Wired AI

Elon Musk Loses Landmark Lawsuit Against OpenAI

Wired's verdict coverage notes the judge quickly affirmed the jury's two-hour decision.

policy market

Source →

News Ars Technica AI

Elon Musk took too long to sue OpenAI, jury unanimously agrees

Ars Technica reports the jury's rationale leaned heavily on Musk having waited too long to file; appeal incoming.

policy regulation

Source →

News MIT Technology Review

Here's why Elon Musk lost his suit against OpenAI

MIT Tech Review breaks down the legal reasoning behind the unanimous advisory verdict.

policy regulation

Source →

News The Verge AI

Musk v. Altman proved that AI is led by the wrong people

The Verge's harshest take — verdict aside, the trial exposed both principals as poor stewards of frontier AI.

policy market

Source →

Anthropic's quiet expansion week

Anthropic acquired Stainless — the dev-tool company whose SDK generation already runs inside OpenAI, Google, and Cloudflare — and added SandboxAQ's drug-discovery models inside Claude as a first-class capability. Coming after Claude for Small Business and Claude for Legal, the pattern is now legible: own the integration layer, then the vertical.

News TechCrunch AI

Anthropic has acquired the dev tools startup used by OpenAI, Google, and Cloudflare

Anthropic bought Stainless, a New York-based SDK-generation startup whose tools are already deployed inside OpenAI, Google, and Cloudflare.

Why it matters

Gives Anthropic ownership of a piece of infrastructure embedded in its competitors' developer stacks.
Suggests Anthropic's near-term moat building is in tooling and distribution, not just model quality.
Awkward dependency moment for OpenAI and Google — both will likely accelerate in-house replacements.

market products tools

Source →

News TechCrunch AI

SandboxAQ brings its drug discovery models to Claude — no PhD in computing required

SandboxAQ's drug-discovery models are now first-class inside Claude, lowering the technical bar for biopharma teams.

Why it matters

Anthropic adds a regulated vertical (biopharma) to Legal and Small Business in the same week.
Demonstrates a working pattern for partner-model integration inside Claude — a template other domain vendors will pursue.

products market tools

Source →

Agent productivity reality check

A new multi-agent TDD benchmark for full-stack web apps finds generated applications fail functional requirements in over 70% of cases when actually deployed and exercised. AgentKernelArena exposes the same gap in GPU-kernel work — agent SOTA is measured on single calls, not full workflows. Bug-bounty programs are buckling under AI-generated slop submissions. All three sharpen yesterday's commencement-boos thesis: the operator-floor question is real.

News Ars Technica AI

Bug bounty businesses bombarded with AI slop

Bug-bounty programs report a flood of AI-generated low-quality submissions straining triage capacity at major vendors.

safety tools policy

Source →

Agent memory and leaderboards consolidate

A training-free plug-and-play memory module for LLMs (NGM), IBM's new Open Agent Leaderboard, and Odyssey's Agora-1 multi-agent world model all point at the same maturing layer: the agent stack now has shared components and shared benchmarks that smaller teams can build against without rolling their own.

News Hugging Face

The Open Agent Leaderboard

IBM Research launches a public Open Agent Leaderboard on Hugging Face for cross-vendor agent comparison.

agents evaluation benchmarks

Source →

News Hacker News

Agora-1: The Multi-Agent World Model

Odyssey introduces Agora-1, a multi-agent world model focused on coordinated simulation across many actors (113 HN points).

world-models agents

Source →

Papers

6 items

Vertical AI benchmarks land

Three serious vertical benchmarks dropped together: CHI-Bench for end-to-end healthcare workflows, FINESSE-Bench for financial domain knowledge, and TOBench for tool-using agents on professional tasks. The shared move is to evaluate real workflows with real policy constraints, not isolated capabilities.

Paper Hugging Face

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench tests agents on policy-dense, multi-role, multi-turn healthcare workflows — the conditions where current evaluation gives the most misleading scores.

Why it matters

Bridges the gap between agent benchmarks and what a healthcare buyer actually needs to validate.
Should become a default eval for any vendor pitching agentic clinical-ops automation.
Pairs with the Ontario AI-notetaker audit this week — research and field both pointing the same direction.

agents evaluation benchmarks products

Source → Arc

Paper Hugging Face

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

Hierarchical finance benchmark covering domain knowledge and technical analysis — pushes past the QA-style scope of FinQA and TAT-QA.

Why it matters

Useful procurement-grade benchmark for any model pitched into finance.
Targets reasoning across tiers (knowledge, analysis, judgment), not just retrieval.

evaluation benchmarks reasoning

Source → Arc

Paper Hugging Face

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

TOBench evaluates tool-using agents on multimodal inputs, coordinated tool calls, and intermediate-artifact inspection — closer to professional workflow shape.

agents tool-use benchmarks multimodal

Source → Arc

Agent productivity reality check

Paper Hugging Face

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

When generated web apps are deployed and exercised through simulated browser interactions, they fail functional requirements in over 70% of cases.

functional-requirement failure rate >70%

Why it matters

Quantifies the 'looks-runnable, isn't-shippable' gap that operators have been describing anecdotally.
Suggests multi-agent TDD as the missing bridge between code generation and deployable software.
Direct counter to vibe-coding triumphalism — useful for SMB-buyer due diligence.

agents code evaluation benchmarks

Source → Arc

Paper Hugging Face

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

GPU-kernel agent benchmark measuring full workflows with profiler/compiler tool use — current SOTA evaluates single LLM calls, hiding generalization gaps.

agents evaluation benchmarks compute

Source → Arc

Agent memory and leaderboards consolidate

Paper Hugging Face

NGM: A Plug-and-Play Training-Free Memory Module for LLMs

Drop-in memory module for LLMs that requires no learned embeddings — agents get explicit lookup without retraining the base model.

agents memory tools

Source → Arc

Also today

News · Hacker News AI eats the world (Spring 26) — Annual deck on AI's penetration across industries — 245 HN points and a useful reference for charts on adoption and capex.
News · MIT Technology Review Inside Anduril and Meta's quest to make smart glasses for warfare — Anduril shares details on the Meta-powered AR military headset — the defense-tech version of the consumer smart-glasses race.
News · TechCrunch AI Amazon's new Alexa+ powered feature can generate podcast episodes — Alexa+ now generates on-demand AI podcasts on essentially any topic — agentic surface widening across Amazon's stack.
News · The Verge AI Gemini is in danger of going full Copilot — The Verge on the Gemini sparkle icon appearing in too many Google surfaces — the UX-creep critique Google has not yet answered.
News · MIT Technology Review What to expect from Google this week — MIT Tech Review previews Google I/O 2026 — agentic Gemini, Googlebooks rollout, more vibe-coded surfaces.
News · Ars Technica AI Legal fail: Don't use AI to sue Facebook users for calling you a bad date — Fake citations dashed a personal lawsuit — another datapoint on AI hallucinations bleeding into court filings.
News · Import AI Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment — Jack Clark's newsletter on AI-driven cyberops, Muon optimizer pathologies, and positive-alignment framing.
News · Hugging Face Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation — NVIDIA + HF guide for adapter-based fine-tuning of Cosmos Predict 2.5 on robot video — practical recipe for robotics teams.
News · Hugging Face PaddleOCR 3.5: Running OCR and Document Parsing Tasks with a Transformers Backend — PaddleOCR 3.5 supports a Transformers backend — credible open option for document AI pipelines.
Paper · Hugging Face StableVLA: Towards Robust Vision-Language-Action Models without Extra Data — Robustness recipe for vision-language-action models that doesn't require additional data — practical for sim-to-real teams.
Paper · Hugging Face VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation — Instance-level video understanding via native agentic tool invocation — pushes video VLMs past whole-clip captioning.
Paper · Hugging Face E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring — Post-merge quantization that anchors on merged weights — recovers accuracy lost when quantizing model merges.
Paper · Hugging Face A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation — Automatic generator for formally verifiable abstract reasoning benchmarks — useful for keeping eval fresh against memorization.
Paper · Hugging Face AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents — Self-evolving visual skill memory for VLM agents that doesn't require a teacher — practical for sustained-use deployments.