Request Body
Text to convert to speech (up to 10,000 characters per request)
Voice model to use for speech generation:Neural Voices (Premium):
neural-female-1
- Natural female voice (English)neural-male-1
- Natural male voice (English)neural-female-2
- Warm female voice (English)neural-male-2
- Professional male voice (English)
multilingual-female
- Supports 40+ languagesmultilingual-male
- Supports 40+ languages
storyteller
- Narrative and audiobook stylenews-anchor
- Professional news reading styleconversational
- Casual, friendly toneassistant
- AI assistant style
Language code for speech synthesis:
en
- English (US)en-GB
- English (UK)es
- Spanishfr
- Frenchde
- Germanit
- Italianpt
- Portugueseja
- Japaneseko
- Koreanzh
- Chinese (Mandarin)ru
- Russianar
- Arabichi
- Hindi- And 30+ more languages
Speech speed multiplier (0.25 to 4.0). Values below 1.0 slow down speech, above 1.0 speed it up.
Pitch adjustment in semitones (-20 to +20). Negative values lower pitch, positive values raise it.
Volume adjustment in dB (-20 to +20). Negative values decrease volume, positive values increase it.
Emotional tone for the speech:
neutral
- Balanced, natural tonehappy
- Upbeat and cheerfulsad
- Melancholic and somberangry
- Intense and forcefulexcited
- Energetic and enthusiasticcalm
- Peaceful and relaxedwhisper
- Soft, quiet deliveryshouting
- Loud, emphatic delivery
Audio output format:
mp3
- MP3 format (good compression, widely supported)wav
- WAV format (uncompressed, highest quality)ogg
- OGG Vorbis format (open source, good compression)aac
- AAC format (high quality, good compression)flac
- FLAC format (lossless compression)
Audio sample rate in Hz:
22050
- Standard quality (smaller files)44100
- CD quality (recommended)48000
- Professional quality (larger files)
Whether to process SSML (Speech Synthesis Markup Language) tags in the text
Whether to automatically add natural pauses at punctuation
Custom pronunciation for specific words
Response
URL to download the generated audio file (expires in 24 hours)
Base64-encoded audio data (if requested in smaller files)
Duration of the generated audio in seconds
Number of words in the input text
Number of characters processed
Format of the generated audio file
Size of the generated audio file (e.g., “2.3MB”)
Voice model used for generation
Language used for speech synthesis
Generation metadata
Example
SSML Support
Use Speech Synthesis Markup Language for advanced speech control:Python
Batch Text-to-Speech
Generate multiple audio files in one request:Python
Multilingual Speech Generation
Generate speech in multiple languages:Python
Voice Customization
Fine-tune voice characteristics:Python
Pronunciation Customization
Control pronunciation of specific words:Python
Real-time Streaming
Stream audio as it’s generated (for long texts):Python
Audio Post-Processing
Apply effects and enhancements to generated speech:Python
Use Cases
Content Creation
- Podcasts: Generate consistent voice for episodes
- Audiobooks: Narrate books with different character voices
- Video Narration: Create voiceovers for videos and presentations
- Social Media: Generate audio content for platforms
Accessibility
- Screen Readers: Convert text to speech for visually impaired users
- Learning Disabilities: Help users with reading difficulties
- Language Learning: Provide pronunciation examples
- Elderly Care: Read news, messages, and books aloud
Business Applications
- Call Centers: Automated voice responses and IVR
- E-learning: Generate course narration and tutorials
- Announcements: Create public address and notification systems
- Marketing: Voice ads and promotional content
Entertainment
- Gaming: Character voices and narration
- Interactive Stories: Dynamic story narration
- Virtual Assistants: Personalized AI assistant voices
- Audio Drama: Multi-character voice production
Voice Quality Comparison
Different voice models offer varying quality levels:Voice Type | Quality | Speed | Use Case |
---|---|---|---|
Neural | Highest | Slower | Professional content |
Standard | Good | Fast | General applications |
Multilingual | Good | Medium | International content |
Character | Variable | Medium | Entertainment, storytelling |
Best Practices
Text Preparation
- Clean Text: Remove special characters that don’t translate to speech
- Punctuation: Use proper punctuation for natural pauses
- Abbreviations: Spell out abbreviations or use pronunciation guide
- Numbers: Consider how numbers should be spoken (digits vs. words)
Voice Selection
- Match Content: Choose appropriate voice for content type
- Consistency: Use same voice for related content
- Audience: Consider target audience preferences
- Language: Ensure voice supports the content language
Quality Optimization
- Sample Rate: Use 44.1kHz for general use, 48kHz for professional
- Format: WAV for highest quality, MP3 for smaller files
- Emotion: Match emotional tone to content
- Speed: Adjust for content type (slower for technical, faster for casual)
Pricing
- Standard Voices: $0.15 per 1K characters
- Neural Voices: $0.25 per 1K characters
- Multilingual: $0.20 per 1K characters
- Custom Voices: $0.35 per 1K characters
- Streaming: Additional $0.05 per 1K characters
- Audio Effects: Additional $0.10 per minute of audio
Limitations
- Character Limit: 10,000 characters per request
- Processing Time: 1-10 seconds depending on text length and voice complexity
- File Expiration: Generated audio URLs expire after 24 hours
- Language Support: Quality varies by language, best for English
Generated audio files are automatically deleted after 30 days. Download important files for long-term storage.
For best results with technical content, use the pronunciation guide feature to ensure accurate pronunciation of specialized terms, brand names, and technical jargon.