Open SourceHigh Impact·Thursday, March 26, 2026

Mistral's Voxtral TTS: Open-Source Voice AI With 90ms Latency

Mistral released Voxtral TTS, an open-source text-to-speech model with 90ms latency, 9-language support, and sub-5-second voice cloning for edge deployment.

What happened

Mistral AI released Voxtral TTS, an open-source text-to-speech model built on the Ministral 3B architecture. The model supports 9 languages, clones voices from under 5 seconds of audio, and achieves a 90ms time-to-first-audio for 500-character inputs. It is designed for edge deployment — running on devices as small as a smartwatch — and is priced as a fraction of competitors like ElevenLabs, Deepgram, and OpenAI. The model also supports multimodal input/output including audio, text, and image for end-to-end agentic pipelines.

Why it matters to you

personalized

Voxtral TTS runs on Ministral 3B, which means it's small enough to self-host on-device and fast enough for real-time voice applications at 90ms TTFA. The open-source release means you can fine-tune it, avoid per-character API pricing, and own the voice pipeline end-to-end. Multimodal input support (audio + text + image) makes it usable as a drop-in output layer for agentic systems.

What to do about it

Pull the Voxtral TTS model from Mistral's repo this week and benchmark TTFA against your current ElevenLabs or OpenAI TTS integration on a 500-character input — if latency is comparable and cost is lower, swap the provider in your staging environment.

Try this now

Python10 min

1
Install the Mistral client: pip install mistralai and set your MISTRAL_API_KEY

Community

5 comments