DeepMindWednesday, April 15, 2026

Google Releases Gemini 3.1 Flash TTS With Audio Tags for Granular Speech Control

Google has introduced Gemini 3.1 Flash TTS, a text-to-speech model designed to give developers and enterprises precise control over AI-generated speech. The model rolls out today in preview for developers via the Gemini API and Google AI Studio, for enterprises on Vertex AI, and for Workspace users through Google Vids.

The core innovation is a system of audio tags that embed natural language commands directly into text input, allowing developers to control vocal style, pace, and delivery with granular precision. Rather than applying speech parameters at the document level, these inline tags enable expression changes mid-sentence, giving creators the ability to direct character performance with specificity.

Google AI Studio now provides three layers of control. Scene direction allows developers to define environment and dialogue instructions that keep characters in-character across multiple turns. Speaker-level specificity lets developers assign unique Audio Profiles to characters, then use Director's Notes to toggle pace, tone, and accent. Once a performance is perfected, these exact parameters can be exported as Gemini API code to ensure consistent, recognizable voices across projects and platforms.

On the Artificial Analysis TTS leaderboard—a benchmark measuring thousands of blind human preferences—Gemini 3.1 Flash TTS achieved an Elo score of 1,211. The leaderboard also positioned the model in its "most attractive quadrant" for combining high-quality speech generation with low cost. Google states this represents improved overall speech quality, making it the company's most natural and expressive text-to-speech model to date.

The model supports native multi-speaker dialogue and covers 70+ languages, enabling developers to create localized, expressive speech experiences for global audiences. Early testers have highlighted the audio tags as providing a new level of creative precision, transforming simple text into high-fidelity vocal performance.

All audio generated by Gemini 3.1 Flash TTS includes SynthID watermarking—an imperceptible mark embedded directly into audio output that allows reliable detection of AI-generated content to help prevent misinformation.

Source: DeepMind

← Back to Daily