Microsoft AI released MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — multimodal models available on Microsoft Foundry, signaling a serious in-house AI build-out.
Microsoft AI's MAI Superintelligence team, led by CEO Mustafa Suleyman, released three foundational models: MAI-Transcribe-1 (speech-to-text across 25 languages, 2.5x faster than Azure Fast), MAI-Voice-1 (generates 60 seconds of audio in one second with custom voice capability), and MAI-Image-2 (image/video generation). All three are now available on Microsoft Foundry, with transcription and voice models also accessible via MAI Playground. The release follows a renegotiated Microsoft-OpenAI partnership that gave Microsoft greater freedom to pursue independent AI research.
MAI-Transcribe-1 running 2.5x faster than Azure Fast is a concrete, measurable improvement — not marketing. If your stack already lives in Azure, swapping transcription endpoints could cut latency and cost in one move. MAI-Voice-1's 60-seconds-in-one-second generation ratio is genuinely competitive with ElevenLabs and PlayHT on speed, and the custom voice feature opens new product surface area without third-party dependency.
Hit the MAI Playground this week with your highest-volume transcription workload — paste a 2-minute audio sample and compare output latency and accuracy against your current Whisper or Azure Fast setup. If MAI-Transcribe-1 beats both on word error rate, you have a migration case ready.
Go to the MAI Playground at ai.microsoft.com and sign in with a Microsoft account
Tags
Signals by role
Also today
Tools mentioned