Overview
Google Text-to-Speech AI, delivered through Cloud Text-to-Speech, is Google's API platform for turning text into natural-sounding speech. The page now combines classic Google Cloud TTS positioning with newer Gemini TTS, Chirp 3 HD voices, and instant custom voice features, making it broader than a basic “text in, MP3 out” voice API.
That broadening matters. This is not just a legacy speech-synthesis endpoint with fixed voices. Google is now framing the product around several voice-generation paths: promptable Gemini TTS for style and emotion control, Chirp 3 HD for high-fidelity conversational output, instant custom voice for brand or character voice creation, and older Standard, WaveNet, Neural2, and Studio voice families that still matter for cost-sensitive production workloads.
As of April 24, 2026, Google highlights 380+ voices across 75+ languages and variants, SSML and prompt-based control, REST and gRPC APIs, streaming audio synthesis, long-audio synthesis for up to 1 million bytes of input, and a mix of token-based Gemini pricing plus character-based legacy pricing. For most buyers, the main decision is not whether Google can synthesize speech at all, but which voice family and pricing model best fit their latency, realism, brand, and budget requirements.
For adjacent research, compare AI music generator tools, AI music generator guide, AI music video generator tools.
Key Features
380+ voices across 75+ languages and variants — Google explicitly markets the current product around broad language and voice coverage, which makes it viable for global apps, contact centers, accessibility, and multi-market content.
Gemini TTS for promptable speech generation — The product page highlights Gemini TTS as a newer generation of speech synthesis where developers can guide style, accent, speed, tone, and emotional expression with natural-language prompts.
Chirp 3 HD for higher-fidelity conversational speech — Google positions Chirp 3 HD around more lifelike conversational output, lower-latency streaming, and richer expressiveness for customer-service and voice-agent scenarios.
Instant custom voice, with allow-listed access — Google says Chirp 3 instant custom voice can build a personalized voice model from roughly 10 seconds of audio, but the documentation says access is restricted to allow-listed users and requires contacting sales.
Flexible synthesis controls — Depending on the model, developers can use plain text, SSML, and prompt-driven instructions to control pronunciation, pauses, date and number formatting, delivery style, and emotional tone.
Streaming, long-audio, and API delivery — The current page highlights streaming audio synthesis for low-latency responses, long-audio synthesis for larger asynchronous jobs with documented input limits, and integration through REST and gRPC APIs for devices, apps, and backend services.
Pricing & Plans
Google AI Speech does not use one single pricing model anymore. The current pricing page splits costs between Gemini TTS token pricing, newer LLM-based or premium character-priced voice families, and older legacy character-priced voices.
| Voice family | Pricing signal | Positioning |
|---|---|---|
| Gemini 2.5 Flash TTS / Flash-Lite Preview TTS | No free tier; from $0.50 per 1 million text input tokens and $10 per 1 million audio output tokens | Lowest-cost Gemini TTS entry for promptable speech generation |
| Gemini 3.1 Flash TTS (Preview) | No free tier; $1 per 1 million text input tokens and $20 per 1 million audio output tokens | Newer Gemini preview model with higher output pricing |
| Gemini 2.5 Pro TTS | No free tier; $1 per 1 million text input tokens and $20 per 1 million audio output tokens | Higher-tier Gemini speech generation |
| Chirp 3 HD voices | First 1 million characters free, then $30 per 1 million characters | Premium high-fidelity speech |
| Instant custom voice | No free tier; $60 per 1 million characters | Personalized custom voice output; access is restricted to allow-listed users and may require contacting sales |
| Legacy Standard / WaveNet voices | First 4 million characters free, then $4 per 1 million characters | Lowest published recurring paid price for older voice families |
| Neural2 / Polyglot Preview voices | First 1 million characters free, then $16 per 1 million characters | Mid-tier modern voice families |
| Studio voices | First 1 million characters free, then $160 per 1 million characters | Premium long-form or higher-end voice output |
The practical implication is that “Google AI Speech pricing” can vary widely depending on whether you want the newest promptable Gemini voices, premium conversational quality, or the cheapest legacy synthesis path. For cost-sensitive product teams, Standard or WaveNet can still be the floor. For modern agent and branded-voice experiences, Gemini TTS or Chirp 3 usually matter more than the old headline free tiers.
Best For
- Teams building multilingual voice interfaces on Google Cloud
- Contact-center and voice-agent products that need more natural synthetic speech
- Accessibility, audiobook, and read-aloud workflows
- Apps that need both cheap baseline TTS and a path to higher-end premium voices
- Organizations exploring branded or custom synthetic voices
FAQ
What is Google AI Speech?
In this context it refers to Google's Cloud Text-to-Speech product, which converts text into natural-sounding speech through APIs and now includes Gemini TTS, Chirp 3 voice families, and custom voice options.
How many voices and languages does Google offer?
Google currently markets the product around 380+ voices across 75+ languages and language variants.
Does Google AI Speech support custom voices?
Yes, but availability needs qualification. Google highlights Chirp 3 instant custom voice and broader custom voice options, while the instant custom voice documentation says access is restricted to allow-listed users and requires contacting sales.
Is there a free tier?
Yes for some voice families. Legacy Standard and WaveNet voices currently include the first 4 million characters free each month, while Neural2 and Chirp 3 HD list smaller free usage allowances. Gemini TTS pricing currently shows no free usage tier.
What is the cheapest paid option?
Based on Google's published pricing, the lowest listed recurring paid rate is for legacy Standard and WaveNet voices at $4 per 1 million characters after the free allowance.
Is Gemini TTS priced the same way as older Google TTS voices?
No. Gemini TTS uses token-based pricing with separate charges for text input tokens and audio output tokens, while older voice families are generally billed by character count.




