Microsoft AI’s first reasoning model, MAI-Thinking-1, debuted at the company’s Build 2026 developer conference in San Francisco on 2 June, joining a field where most rival labs have already shipped multiple generations of reasoning systems. The 35-billion-parameter model is designed for multi-step agentic tasks and, on the SWE Bench Pro coding benchmark, scored similarly to Anthropic’s Claude Opus 4.6.

Microsoft AI Reasoning Model Joins a Crowded Field

MAI-Thinking-1 was not the only model Microsoft unveiled. According to Microsoft AI’s Build 2026 keynote transcript, the company launched six additional MAI models at the same event, spanning image generation, transcription, and voice. Among them, MAI-Image-2.5 and its Flash variant ranked second on image-editing leaderboards, surpassing Nano Banana 2, while the transcription model claims state-of-the-art accuracy across 43 languages, ahead of equivalent offerings from Google and OpenAI.

One MAI-tuned model was described in the keynote as comparable to GPT-5.4 on public and private benchmarks while being up to 10x more efficient. Digital Applied’s analysis of the MAI family also confirms that MAI-Code-1-Flash, a compact coding model, began rolling out across GitHub Copilot plans on 2 June.

The strategic framing matters as much as the models themselves. Windows Forum’s Build 2026 coverage characterises the MAI launch as a deliberate shift away from OpenAI exclusivity, with Azure, Foundry, GitHub, and Copilot positioned as the platform layer regardless of which underlying model (OpenAI, Anthropic, Microsoft AI, or another lab) powers any given task. Microsoft AI also emphasised that MAI-Thinking-1 was trained exclusively on commercially safe data, a pitch aimed squarely at enterprise clients wary of AI copyright litigation.

Anthropic Keeps Pushing on Safety and Honesty

The most recent Anthropic release before the Build announcements was Claude Opus 4.8, which replaced Opus 4.7 on 28 May at the same price point and at one-third the cost of the earlier version. Beyond the price, the model carries substantive changes to how it reasons. Anthropic’s release notes confirm that Opus 4.8 uses ‘adaptive thinking’ as its only supported thinking mode, triggering reasoning only when it judges the turn requires it, with the effort parameter defaulting to ‘high’ across the Claude API and Claude Code.

On benchmark performance, Anthropic’s Claude Opus product page states that Opus 4.8 achieved the highest score yet recorded on Anthropic’s Legal Agent Benchmark and became the first model to break 10% overall on the all-pass standard for that benchmark. According to Anthropic’s Opus 4.8 announcement, the model is around four times less likely than Opus 4.7 to allow flaws in code it has written to pass unremarked.

The safety trajectory is consistent with where Anthropic has been heading. The company reported that Opus 4.7 carried a 92% honesty rate and that Opus 4.8 shows substantially lower rates of misalignment than its predecessor. Opus 4.8 scored higher than Opus 4.7 on two coding benchmarks, though it did not fully surpass OpenAI’s GPT-5.5.

Mythos Moves Beyond Preview

When Anthropic unveiled Claude Mythos in April as a model it deemed too capable to release publicly, it cited the system’s ability to carry out computer security tasks as a core concern. The company launched Project Glasswing in response, a collaborative effort with rival labs and security organisations. The report named Palo Alto Networks as a security partner; Anthropic’s Project Glasswing page identifies CrowdStrike as a founding member, alongside AWS, Microsoft MSRC, and Google Cloud’s Vertex AI team.

Mythos has since moved beyond its Preview designation. Anthropic’s Claude Mythos page describes a subsequent release called Claude Mythos 5, with Claude Fable 5 described as the same underlying model but fitted with robust safeguards for cybersecurity and biology domains for broader release. Anthropic’s Frontier Red Team has published a detailed writeup on how the system discovers, reproduces, and patches real-world vulnerabilities.

Other Notable Releases Since April

Google launched Gemini 3.5 Flash at Google I/O on 19 May, with the model now running as the default in both AI Mode in Search and the Gemini app. Gemini 3.5 Pro is expected to follow in June. The system card for 3.5 Flash contains no mention of hallucination rate or sycophancy, which is relevant given Search’s reach.

GPT-5.5 Instant, released on 5 May, replaced GPT-5.3 Instant as the default model in ChatGPT. OpenAI said the model produced 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts in medicine, law, and finance. Before that, ZDNET’s David Gewirtz gave GPT-5.5 an Expert Score of 93/100, describing it as better and faster than GPT-5.4 across agentic coding, scientific research, and factual accuracy.

GPT-5.4, released on 5 March, was framed by OpenAI as built for professional work. According to the company’s own internal testing, the model matches or outperforms human professionals 83% of the time, though that figure awaits independent third-party verification. Nvidia’s Nemotron 3 Nano Omni, released on 28 April, takes a different architectural approach, letting agents perceive and reason across visual, audio, and textual inputs within a single system rather than bouncing between separate models for each modality.

With MAI-Thinking-1 now in the field, the next test for Microsoft’s reasoning model is whether enterprise benchmarks outside the lab confirm the SWE Bench Pro parity it claims with Opus 4.6, and whether its commercially clean training data becomes a genuine differentiator as AI copyright cases work their way through the courts.

Share.

Comments are closed.