Microsoft launches its second generation AI inference chip, Maia 200

5gDedicated

Signaling that the future of AI may not just be how many tokens an AI model generates, but how optimally it does so, Microsoft has announced Maia 200, which it described as a breakthrough inference accelerator and inference powerhouse.

The AI silicon is designed for heterogeneous AI infrastructure in multiple environments, and was specifically developed for inferencing on large reasoning models. Microsoft claims it is the most performant first-party silicon from any hyperscaler today, and the most efficient inference system it has ever deployed.

Microsoft’s approach differs from that of other hyperscalers, said Matt Kimball, VP and principal analyst, Moor Insights & Strategy. “Where the other cloud service providers (CSPs) delivered platforms that focused on training and inference with a bias for their own bespoke stacks, Microsoft sees inference as the strategic landing zone and built a platform optimized for that agentic AI-powered environment,” he noted.

How Maia stacks up by the numbers

Maia 200 delivers 3X better 4-bit floating-point (FP4) performance than third-generation Amazon Trainium, Microsoft claims, and 8-bit floating point (FP8) performance above that of Google’s seventh generation TPU.

By the numbers, this means that Maia features:

10,145 four-bit floating point (FP4) teraflops at peak, versus 2,517 with AWS Trainium3

5,072 eight-bit floating point (FP8) teraflops at peak, versus 2,517 with Trainium3, and 4,614 with Google TPU version 7

High-bandwidth memory (HBM) of 7 terabits per second versus 4.9 with Trainium and 7.4 with Google TPU version 7

HBM capacity of 216GB versus 144GB with Trainium and 192GB with Google TPU version 7

Further, the tech giant says, Maia provides 30% better performance per dollar than the “latest generation hardware in our fleet today.” A “tremendous amount” of high bandwidth memory (HBM) allows models to run as close to compute as possible.

“In practical terms, Maia 200 can effortlessly run today’s largest models, with plenty of headroom for even bigger models in the future,” Microsoft says.

Maia feeds data to models differently, too, through what Microsoft refers to as a redesigned memory subsystem which features a specialized direct memory access (DMA) engine and on-die static random-access memory (SRAM), as well as specialized network-on-chip (NoC) fabric. This all allows for high-bandwidth data movement while increasing token throughput.

Designed for heterogeneity, multi-modal AI

Microsoft says it specifically designed Maia 200 with modern LLMs in mind; forward-thinking customers, it says, are looking not just for text prompts, but access to multimodal capabilities (sound, images, video) that support deeper reasoning capabilities, multi-step agents, and, eventually, autonomous AI tasks.

As part of its heterogeneous AI infrastructure, Microsoft says that Maia 200 will serve multiple models, including OpenAI’s latest GPT-5.2 family. It integrates seamlessly with Microsoft Azure, and Microsoft Foundry and Microsoft 365 Copilot will also benefit from the chip. The company’s superintelligence team also plans to use Maia 200 for reinforcement learning (RL) and synthetic data generation to improve in-house models.

From a specification perspective, Maia 200 exceeds Amazon’s Trainium and Inferentia and Google’s TPU v4i and v5i, noted Scott Bickley, advisory fellow at Info-Tech Research Group. It is produced on a 3nm node, versus the 7nm or 5nm nodes for the Amazon and Google chips, and it also displays superior performance in compute, interconnect, and memory capabilities, he said. 

However, he noted, “while these numbers are impressive, customers should verify actual performance within the Azure stack prior to scaling out workloads away from Nvidia, as an example.” They should also ensure that part of the 30% saving being realized by Microsoft is being passed through to the customer via their Azure subscription charges, he added.

“Use cases ideal for the Maia 200 would entail high-throughput workloads coupled with memory for large models,” said Bickley.

Microsoft improving on previous design challenges

Previous versions of Maia were “plagued by design and development challenges” that were “mostly self-inflicted,” Bickley noted. This slowed Microsoft’s development in the space in 2024 and 2025, while its competitors were concurrently speeding up development.

“With access to OpenAI’s IP, they appear to be closing the gap,” he said. And using TSMC’s 3nm process, HBM and on-chip SRAM, and optimization for inference performance, Microsoft “may have evolved this chip in a way that will materially reduce their own infrastructure costs.”

Maia’s software-hardware architecture makes sense for inferencing, added Moor’s Kimball. “Rich SRAM and HBM allow that bandwidth, with steady-state inferencing, to fly,” he said. Further, the chip features industry-standard interconnects to “deliver performance at the component, system, rack, and even datacenter level.”

Microsoft’s open software stack is “designed specifically to make standing up inference on Maia frictionless,” Kimball noted, emphasizing, “this is not Microsoft trying to replace Nvidia or AMD. It’s about complementing.”

Arguably, Microsoft knows the enterprise IT organization better than any other cloud, as its software and tools have dominated this market for decades, Kimball pointed out. Its Maia team has taken advantage of this knowledge to deliver an inference service that “seems to be simply embedded in the Azure platform fabric,” he said.

Developers and other early adopters can sign up for the preview Maia 200 software development kit (SDK), which provides tools for building and optimizing models for Maia 200, such as PyTorch integration, a Triton compiler, and an optimized kernel library, as well as access to Maia’s low-level programming language.

Maia 200 is currently deployed in Microsoft’s US Central data center region near Des Moines, Iowa. It will next arrive in the US West 3 data center region near Phoenix, Arizona, followed by other regions; timing and locations have not yet been revealed.

This article originally appeared on NetworkWorld.Microsoft launches its second generation AI inference chip, Maia 200 – ComputerworldRead More