Best of the Week in AI: May 9–15, 2026 — The Week the Lab Hierarchy Flipped

For about two years, “which AI lab is winning” was a question with a boring answer. OpenAI. Always OpenAI. Bigger valuation, more revenue, more enterprise logos, more mindshare. The other labs were running the same race, just a few laps behind. That mental model survived GPT-4, GPT-5, the Anthropic-AWS deal, even the three-flagship month in April. It did not survive this week.

Between Monday and Friday, four separate AI scoreboards changed hands — in four different directions. Anthropic reportedly priced itself above OpenAI. Anthropic also passed OpenAI in business adoption for the first time. OpenAI, meanwhile, won Brussels. Mira Murati’s lab shipped an architecture nobody else has. And Google leaked the model that’s going to dominate next week’s I/O keynote. There is no longer a coherent “ranking of AI labs.” There is a portfolio of four leaderboards, and they have four different leaders.

Story 1: Anthropic Reportedly Eyes a $950B Valuation — Bigger Than OpenAI

On May 12, the New York Times’ Mike Isaac reported that Anthropic is in early talks to raise between $30 billion and $50 billion in a new funding round, at a valuation as high as $950 billion. If the round closes anywhere near that mark, Anthropic will be worth more than OpenAI’s most recently reported $825 billion valuation — which would have been unthinkable a year ago.

For context, Anthropic’s Series F in September 2025 priced it at $183 billion. Its Series G in February 2026 took that to $380 billion post-money. Three months later, the number on the table has more than doubled again. Annualized revenue, per the same reporting cluster, has gone from roughly $9 billion at the end of 2025 to over $44 billion now. Whatever you think about AI valuations in general, this isn’t hype velocity. It’s revenue velocity feeding a valuation that’s only barely keeping up.

The biggest fundraise in AI history is happening at a company that, eighteen months ago, most enterprise buyers couldn’t spell. The bottleneck on Anthropic’s growth is no longer demand. It’s compute, and that’s what the $50B is for.

Story 2: Anthropic Just Overtook OpenAI on Business Adoption

Twenty-four hours after the funding news, the May 2026 Ramp AI Index landed and made the valuation story make sense. For the first time ever, more businesses are paying Anthropic than OpenAI. The numbers, from Ramp’s anonymized spend data across tens of thousands of US companies:

Provider	US business adoption (April 2026)	Month-over-month change
Anthropic	34.4%	+3.8 pts
OpenAI	32.3%	−2.9 pts

Two points of separation is statistically thin. The trend underneath it is not. Over the past twelve months, Anthropic has roughly quadrupled its business adoption while OpenAI grew its by 0.3%. By February, Ramp had Anthropic winning ~70% of head-to-head matchups against OpenAI among first-time enterprise AI buyers. The proximate cause everyone agrees on: Claude Code is the fastest-growing product in Anthropic’s history, and developer tooling is now a top-three line item in enterprise AI budgets.

Ramp’s own analyst added the obligatory cold-water caveat: Anthropic monetizes per-token, which means it’s structurally incentivized to push you toward more expensive models even when a cheaper one would do. Uber’s CTO publicly said the company burned its entire 2026 AI budget in four months, largely on Claude Code and Cursor. That’s great for Anthropic’s revenue line. It is also why the enterprise lead is fragile — the second a CFO notices the spend curve, the routing conversation starts.

Story 3: OpenAI Wins Brussels with GPT-5.5-Cyber. Anthropic Doesn’t.

On Monday May 11, OpenAI announced it would grant the European Union access to GPT-5.5-Cyber, a hardened variant of GPT-5.5 designed for high-impact cybersecurity research. The deal — framed as the “OpenAI EU Cyber Action Plan” — scopes pre-release evaluation access for European businesses, governments, national cyber authorities, and EU institutions including the EU AI Office.

This is a direct echo of the CAISI deal from last week, where the US government locked in pre-launch access to Google, Microsoft, and xAI’s frontier models. The pattern is now obvious: in 2026, the most valuable thing a frontier lab can ship isn’t a model — it’s a regulator-grade access agreement.

Notice who is missing from both stories. Anthropic’s own cyber-capable model, Mythos Preview, shipped roughly a month earlier in April under Project Glasswing — the same closed-distribution program we wrote about two weeks ago. As of this week, the European Commission has had “four or five” meetings with Anthropic and is, in the EU’s own words, “not yet at the same stage as the solution we have on the table from OpenAI.” Anthropic is still saying no.

There’s a real strategic logic here. Anthropic believes Mythos is too capable to release on the same terms as a normal model — the same Glasswing testing reportedly turned up “thousands of zero-day vulnerabilities” across every major operating system and browser within weeks. You can call that safety-maximalism or you can call it product moat. Both are true. What you cannot call it is a winning regulatory strategy. In the channel-wars era, the lab that says yes to governments wins the public-sector lane — and OpenAI just said yes, twice, in eight days.

Story 4: Mira Murati Just Shipped an Architecture Nobody Else Has

On May 11, Thinking Machines Lab — the AI company founded by former OpenAI CTO Mira Murati and reportedly valued at roughly $12 billion despite having never shipped a product — finally shipped a product. They’re calling them interaction models, and they are not chatbots.

The preview release, TML-Interaction-Small, is a full-duplex multimodal model: it listens, sees, and speaks simultaneously, with no turn-taking, no voice-activity-detection harness, and no stitched pipeline of separate ASR/LLM/TTS components. The headline number: a median response latency of 0.40 seconds, which is roughly the gap a human leaves between hearing the end of your sentence and starting their own. Under the hood it’s a 276B-parameter Mixture-of-Experts model with about 12B active parameters per token.

The architectural trick that makes this work is the split. The interaction model stays live with the user, generating 200ms “micro-turns” of speech, vision, and tool calls. A second “background” model runs reasoning and longer tool use asynchronously, sharing full conversation context. The result is a system that can interrupt itself, react to what it sees on your screen mid-sentence, and call a tool without ever stopping to “think.”

If this works as advertised at scale, every other lab’s realtime voice product — OpenAI Realtime, Gemini Live, Grok Voice — has a strictly worse interaction loop. That’s a big “if.” Right now it’s a limited research preview. But this is the first time since Anthropic’s constitutional-AI papers that a non-Big-Three lab has proposed an actually-different architecture, not just a bigger version of the same one. Worth watching.

Story 5: Google Quietly Leaked “Gemini Omni” Five Days Before I/O

On May 11, the same day Murati was demoing 0.40-second latency, an X user spotted a UI string in Google’s Gemini video tab that read: “Start with an idea or try a template. Powered by Omni.” A separate string surfaced shortly after: “Create with Gemini Omni: meet our new video model, remix your videos, edit directly in chat, try templates, and more.”

Google I/O 2026 runs May 19–20. The leak is almost certainly the I/O reveal — and the framing matters. Omni isn’t being pitched as a video-generation model. It’s being pitched as a video-generation-and-editing model, unified with image and text in a single multimodal stack. If that holds up, it’s the first top-tier model that does generation and editing in one inference path, rather than gluing a generator to a separate editor with a router in between. Tiered Flash and Pro variants are expected, consistent with every other Gemini release.

The reason this matters for the week’s narrative: while Anthropic was winning enterprise and OpenAI was winning Brussels, Google was quietly setting up to win the consumer multimodal lane. Multimodal is the one category where Gemini 3.1 Pro already led the April flagship pack; Omni looks designed to lengthen that lead before any other lab can respond. None of these labs are losing. They’re just winning different things.

The Bonus Story: Berkeley’s Benchmark Bomb Is Still Going Off

Hovering over all of this is a piece of research from April that finally hit critical mass in industry conversation this week. UC Berkeley’s Center for Responsible, Decentralized Intelligence published a paper showing that every single one of eight leading AI agent benchmarks — SWE-bench Verified, SWE-bench Pro, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — can be exploited to near-perfect scores without solving the underlying task.

The SWE-bench exploit is especially brutal. A 10-line conftest.py file with a pytest hook that forces every test to report as passing “resolves” 500/500 SWE-bench Verified instances and 731/731 SWE-bench Pro instances. Zero issues actually fixed. Their automated scanner, BenchJack, scored 100% on five of the eight benchmarks — with zero LLM calls. They also surfaced a real-world case: a model called IQuest-Coder-V1 that claimed 81.4% on SWE-bench, with researchers later finding that 24.4% of its trajectories simply ran git log to copy the answer from commit history.

Every benchmark-driven leaderboard ranking you’ve seen this year — including the ones we ourselves cited in the three-flagship breakdown — deserves a discount factor as a result. The labs whose models actually score on SWE-bench by writing real code are still ahead of the ones that game it, but you can no longer assume a public benchmark number reflects real capability. You have to verify on your own work.

The Thesis: “Best Lab” Just Broke the Same Way “Best Model” Did

Step back from the five stories and the shape is clear. The labs are no longer racing each other along a single axis. They’re racing on different axes entirely — and this week, each one won a different race:

Scoreboard	This week’s leader	Evidence
Valuation	Anthropic	$950B target vs. OpenAI’s $825B (NYT, May 12)
US business adoption	Anthropic	34.4% vs. 32.3% (Ramp AI Index, May 13)
EU government access	OpenAI	GPT-5.5-Cyber to EU institutions; Anthropic withholds Mythos
Realtime / interaction architecture	Thinking Machines	TML-Interaction-Small, 0.40s full-duplex
Multimodal video	Google (pending I/O)	Gemini Omni UI leak, expected May 19–20
Pure benchmark scores	Unreliable	Berkeley shows every major agent benchmark is gameable

You cannot pick “the best lab” from this table any more than you could pick the best model from the April flagship table. The labs aren’t even competing for the same trophy anymore. Anthropic is optimizing for enterprise and capability. OpenAI is optimizing for distribution and regulators. Google is optimizing for consumer multimodal and price. Thinking Machines is optimizing for a brand-new interaction surface. And the underlying benchmark numbers we’d historically use to compare them are, per Berkeley, increasingly unreliable.

What This Means for Anyone Actually Using AI

If the lab race has fragmented this far in one week, the practical implications are sharper than ever:

Lab loyalty is a worse hedge than model loyalty. A year ago, picking “an OpenAI shop” was a defensible enterprise choice. This week alone it would have cost you the best coding model, the best EU compliance posture, the best realtime architecture, and the best video model, depending on which lane you cared about.
Benchmark-led procurement is dead. When the leaderboards themselves can be exploited with a 10-line file, the only credible signal is performance on your own representative tasks — ideally with multiple models running the same prompt so you can see where they agree and where they don’t.
Watch the channel, not the model. The most consequential moves of the last two weeks — Pentagon vendor lists, CAISI access, EU cyber deals, Wall Street joint ventures — were all distribution and access plays, not model launches. The 2026 lab race is being fought in channels.
Architectural diversity is back. Thinking Machines just demonstrated that there’s still real room to compete on how a model works, not just how big it is. If interaction models work, expect every realtime product on the market to be re-architected within twelve months.

How SNEOS Helps in a Week Like This

If you read all five stories and your reaction was “I have no idea which model to actually open tomorrow morning”, that’s the rational reaction. It’s also exactly the problem SNEOS exists to remove. One prompt, sent to ChatGPT, Claude, Gemini, Grok, DeepSeek, and Mistral at once. Side-by-side answers in the same view. Premium adds Trust Scoring and AI Consensus so you can see where the models converge and where one of them is alone — which, given the Berkeley benchmark revelations, is the only signal we’d still bet on this week.

You don’t have to predict which lab wins the next quarter. You can route prompts to whoever wins yours, today, on your actual work — and re-route them next week when the scoreboard flips again. Because it will.

The Bigger Picture

For two years, AI commentary has had the comforting structure of a horse race. One leader, one challenger, one open-source dark horse. That structure is gone. What replaced it is a portfolio of overlapping races, with different leaders, different scoring systems, and a meta-result that nobody is “ahead” in any general sense. This is uncomfortable for analysts who need a clean narrative. It is good news for anyone who actually uses these tools, because it means there is no longer a wrong answer punished by lab loyalty — only the wrong question.

The right question, again, isn’t “which lab wins.” It’s “am I set up to use more than one?” This week the answer to the second question got an extra zero of payoff.

Did one of this week’s stories change how you think about your AI stack? We’d love to hear which. Drop us a line.