Mistral released Voxtral TTS, an open-source text-to-speech model with 90ms latency, 9-language support, and sub-5-second voice cloning for edge deployment.
Mistral AI released Voxtral TTS, an open-source text-to-speech model built on the Ministral 3B architecture. The model supports 9 languages, clones voices from under 5 seconds of audio, and achieves a 90ms time-to-first-audio for 500-character inputs. It is designed for edge deployment — running on devices as small as a smartwatch — and is priced as a fraction of competitors like ElevenLabs, Deepgram, and OpenAI. The model also supports multimodal input/output including audio, text, and image for end-to-end agentic pipelines.
Voxtral TTS runs on Ministral 3B, which means it's small enough to self-host on-device and fast enough for real-time voice applications at 90ms TTFA. The open-source release means you can fine-tune it, avoid per-character API pricing, and own the voice pipeline end-to-end. Multimodal input support (audio + text + image) makes it usable as a drop-in output layer for agentic systems.
Pull the Voxtral TTS model from Mistral's repo this week and benchmark TTFA against your current ElevenLabs or OpenAI TTS integration on a 500-character input — if latency is comparable and cost is lower, swap the provider in your staging environment.
Install the Mistral client: pip install mistralai and set your MISTRAL_API_KEY
Tags
Also today
Signals by role
Also today
Tools mentioned