curl -X POST "https://api.tensorone.ai/v2/ai/text-to-speech" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello! Welcome to TensorOne. This is a demonstration of our advanced text-to-speech technology.",
    "voice": "neural-female-1",
    "language": "en",
    "speed": 1.0,
    "emotion": "friendly",
    "outputFormat": "mp3",
    "sampleRate": 44100
  }'
{
  "audioUrl": "https://audio.tensorone.ai/generated/speech_abc123.mp3",
  "duration": 8.5,
  "wordCount": 15,
  "characterCount": 86,
  "audioFormat": "mp3",
  "fileSize": "136KB",
  "voice": "neural-female-1",
  "language": "en",
  "metadata": {
    "processingTime": 3.2,
    "sampleRate": 44100,
    "bitRate": 128,
    "channels": 1
  }
}
Generate high-quality, natural-sounding speech from text using state-of-the-art neural text-to-speech models. Perfect for accessibility, content creation, voice assistants, and audio content production.

Request Body

text
string
required
Text to convert to speech (up to 10,000 characters per request)
voice
string
default:"neural-female-1"
Voice model to use for speech generation:Neural Voices (Premium):
  • neural-female-1 - Natural female voice (English)
  • neural-male-1 - Natural male voice (English)
  • neural-female-2 - Warm female voice (English)
  • neural-male-2 - Professional male voice (English)
Multilingual Voices:
  • multilingual-female - Supports 40+ languages
  • multilingual-male - Supports 40+ languages
Character Voices:
  • storyteller - Narrative and audiobook style
  • news-anchor - Professional news reading style
  • conversational - Casual, friendly tone
  • assistant - AI assistant style
language
string
default:"en"
Language code for speech synthesis:
  • en - English (US)
  • en-GB - English (UK)
  • es - Spanish
  • fr - French
  • de - German
  • it - Italian
  • pt - Portuguese
  • ja - Japanese
  • ko - Korean
  • zh - Chinese (Mandarin)
  • ru - Russian
  • ar - Arabic
  • hi - Hindi
  • And 30+ more languages
speed
number
default:"1.0"
Speech speed multiplier (0.25 to 4.0). Values below 1.0 slow down speech, above 1.0 speed it up.
pitch
number
default:"0.0"
Pitch adjustment in semitones (-20 to +20). Negative values lower pitch, positive values raise it.
volume
number
default:"0.0"
Volume adjustment in dB (-20 to +20). Negative values decrease volume, positive values increase it.
emotion
string
default:"neutral"
Emotional tone for the speech:
  • neutral - Balanced, natural tone
  • happy - Upbeat and cheerful
  • sad - Melancholic and somber
  • angry - Intense and forceful
  • excited - Energetic and enthusiastic
  • calm - Peaceful and relaxed
  • whisper - Soft, quiet delivery
  • shouting - Loud, emphatic delivery
outputFormat
string
default:"mp3"
Audio output format:
  • mp3 - MP3 format (good compression, widely supported)
  • wav - WAV format (uncompressed, highest quality)
  • ogg - OGG Vorbis format (open source, good compression)
  • aac - AAC format (high quality, good compression)
  • flac - FLAC format (lossless compression)
sampleRate
integer
default:"44100"
Audio sample rate in Hz:
  • 22050 - Standard quality (smaller files)
  • 44100 - CD quality (recommended)
  • 48000 - Professional quality (larger files)
enableSSML
boolean
default:"false"
Whether to process SSML (Speech Synthesis Markup Language) tags in the text
addPauses
boolean
default:"true"
Whether to automatically add natural pauses at punctuation
pronunciationGuide
array
Custom pronunciation for specific words

Response

audioUrl
string
URL to download the generated audio file (expires in 24 hours)
audioBase64
string
Base64-encoded audio data (if requested in smaller files)
duration
number
Duration of the generated audio in seconds
wordCount
integer
Number of words in the input text
characterCount
integer
Number of characters processed
audioFormat
string
Format of the generated audio file
fileSize
string
Size of the generated audio file (e.g., “2.3MB”)
voice
string
Voice model used for generation
language
string
Language used for speech synthesis
metadata
object
Generation metadata

Example

curl -X POST "https://api.tensorone.ai/v2/ai/text-to-speech" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello! Welcome to TensorOne. This is a demonstration of our advanced text-to-speech technology.",
    "voice": "neural-female-1",
    "language": "en",
    "speed": 1.0,
    "emotion": "friendly",
    "outputFormat": "mp3",
    "sampleRate": 44100
  }'
{
  "audioUrl": "https://audio.tensorone.ai/generated/speech_abc123.mp3",
  "duration": 8.5,
  "wordCount": 15,
  "characterCount": 86,
  "audioFormat": "mp3",
  "fileSize": "136KB",
  "voice": "neural-female-1",
  "language": "en",
  "metadata": {
    "processingTime": 3.2,
    "sampleRate": 44100,
    "bitRate": 128,
    "channels": 1
  }
}

SSML Support

Use Speech Synthesis Markup Language for advanced speech control:
Python
# Generate speech with SSML markup
ssml_text = """
<speak>
    <p>Welcome to <emphasis level="strong">TensorOne</emphasis>!</p>
    
    <p>Here's what we can do:</p>
    <break time="500ms"/>
    
    <p rate="slow">Generate high-quality speech</p>
    <p rate="fast">Process text lightning fast</p>
    <p pitch="high">With perfect voice control</p>
    
    <p>
        Visit us at <say-as interpret-as="characters">API</say-as> dot tensorone dot 
        <prosody pitch="low" rate="slow">A-I</prosody>
    </p>
</speak>
"""

ssml_result = requests.post(
    "https://api.tensorone.ai/v2/ai/text-to-speech",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "text": ssml_text,
        "voice": "neural-female-1",
        "language": "en",
        "enableSSML": True,
        "outputFormat": "wav"
    }
)

print(f"SSML Speech URL: {ssml_result.json()['audioUrl']}")

Batch Text-to-Speech

Generate multiple audio files in one request:
Python
def batch_text_to_speech(texts, voice="neural-female-1"):
    results = []
    
    for i, text in enumerate(texts):
        response = requests.post(
            "https://api.tensorone.ai/v2/ai/text-to-speech",
            headers={"Authorization": "Bearer YOUR_API_KEY"},
            json={
                "text": text,
                "voice": voice,
                "language": "en",
                "outputFormat": "mp3"
            }
        )
        
        result = response.json()
        result['index'] = i
        result['original_text'] = text
        results.append(result)
    
    return results

# Generate multiple audio files
texts = [
    "Welcome to chapter one: Introduction to AI.",
    "Chapter two covers machine learning basics.",
    "In chapter three, we explore neural networks.",
    "Chapter four discusses practical applications."
]

batch_results = batch_text_to_speech(texts, "storyteller")

for result in batch_results:
    print(f"Chapter {result['index'] + 1}: {result['audioUrl']}")
    print(f"Duration: {result['duration']} seconds")
    
    # Download each file
    audio_response = requests.get(result['audioUrl'])
    filename = f"chapter_{result['index'] + 1}.mp3"
    with open(filename, 'wb') as f:
        f.write(audio_response.content)
    
    print(f"Saved: {filename}")

Multilingual Speech Generation

Generate speech in multiple languages:
Python
def multilingual_speech(text_translations):
    """
    Generate speech for multiple language versions of the same text
    text_translations: dict with language codes as keys, text as values
    """
    results = {}
    
    for lang_code, text in text_translations.items():
        response = requests.post(
            "https://api.tensorone.ai/v2/ai/text-to-speech",
            headers={"Authorization": "Bearer YOUR_API_KEY"},
            json={
                "text": text,
                "voice": "multilingual-female",
                "language": lang_code,
                "speed": 1.0,
                "emotion": "neutral",
                "outputFormat": "mp3"
            }
        )
        
        results[lang_code] = response.json()
    
    return results

# Generate the same message in multiple languages
translations = {
    "en": "Welcome to our platform! We're excited to have you here.",
    "es": "¡Bienvenido a nuestra plataforma! Estamos emocionados de tenerte aquí.",
    "fr": "Bienvenue sur notre plateforme ! Nous sommes ravis de vous avoir ici.",
    "de": "Willkommen auf unserer Plattform! Wir freuen uns, Sie hier zu haben.",
    "ja": "私たちのプラットフォームへようこそ!あなたがここにいることを嬉しく思います。"
}

multilingual_results = multilingual_speech(translations)

for lang, result in multilingual_results.items():
    print(f"{lang.upper()}: {result['audioUrl']}")
    
    # Download each language version
    audio_response = requests.get(result['audioUrl'])
    with open(f'welcome_{lang}.mp3', 'wb') as f:
        f.write(audio_response.content)

Voice Customization

Fine-tune voice characteristics:
Python
def custom_voice_speech(text, voice_settings):
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": voice_settings.get("voice", "neural-female-1"),
            "language": voice_settings.get("language", "en"),
            "speed": voice_settings.get("speed", 1.0),
            "pitch": voice_settings.get("pitch", 0.0),
            "volume": voice_settings.get("volume", 0.0),
            "emotion": voice_settings.get("emotion", "neutral"),
            "outputFormat": "wav",
            "sampleRate": 48000
        }
    )
    return response.json()

# Create different character voices
character_voices = {
    "narrator": {
        "voice": "storyteller",
        "speed": 0.9,
        "pitch": -2.0,
        "emotion": "calm"
    },
    "hero": {
        "voice": "neural-male-1",
        "speed": 1.1,
        "pitch": 3.0,
        "emotion": "excited"
    },
    "villain": {
        "voice": "neural-male-2",
        "speed": 0.8,
        "pitch": -5.0,
        "emotion": "angry"
    }
}

story_text = "The brave knight approached the dark castle, ready for battle."

for character, settings in character_voices.items():
    result = custom_voice_speech(story_text, settings)
    print(f"{character.title()}: {result['audioUrl']}")

Pronunciation Customization

Control pronunciation of specific words:
Python
def speech_with_pronunciation(text, pronunciations):
    pronunciation_guide = [
        {"word": word, "pronunciation": pronunciation}
        for word, pronunciation in pronunciations.items()
    ]
    
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": "neural-female-1",
            "language": "en",
            "pronunciationGuide": pronunciation_guide,
            "outputFormat": "mp3"
        }
    )
    return response.json()

# Technical text with custom pronunciations
tech_text = "The TensorOne API uses OAuth authentication and RESTful endpoints."

custom_pronunciations = {
    "TensorOne": "TEN-sor-wun",
    "API": "A-P-I",
    "OAuth": "OH-auth",
    "RESTful": "REST-ful"
}

result = speech_with_pronunciation(tech_text, custom_pronunciations)
print(f"Technical Speech: {result['audioUrl']}")

Real-time Streaming

Stream audio as it’s generated (for long texts):
Python
def streaming_text_to_speech(text, voice="neural-female-1"):
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech/stream",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": voice,
            "language": "en",
            "outputFormat": "mp3",
            "streamChunkSize": 1024  # Bytes per chunk
        },
        stream=True
    )
    
    # Save streamed audio
    with open('streamed_speech.mp3', 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                print(".", end="", flush=True)  # Progress indicator
    
    print("\nStreaming complete!")

# Stream long text
long_text = """
This is a long text that will be converted to speech using streaming.
The audio will be generated and streamed back in real-time, allowing
for immediate playback even before the entire text is processed.
This is particularly useful for long documents, articles, or books
where waiting for the complete audio file would take too long.
"""

streaming_text_to_speech(long_text, "storyteller")

Audio Post-Processing

Apply effects and enhancements to generated speech:
Python
def enhanced_text_to_speech(text, effects=None):
    if effects is None:
        effects = {}
    
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech/enhanced",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": "neural-female-1",
            "language": "en",
            "outputFormat": "wav",
            "audioEffects": {
                "reverb": effects.get("reverb", False),
                "echo": effects.get("echo", False),
                "normalize": effects.get("normalize", True),
                "noiseReduction": effects.get("noise_reduction", True),
                "compressor": effects.get("compressor", False)
            },
            "backgroundMusic": effects.get("background_music"),
            "fadeIn": effects.get("fade_in", 0.5),
            "fadeOut": effects.get("fade_out", 0.5)
        }
    )
    return response.json()

# Generate speech with audio effects
podcast_text = "Welcome to our weekly tech podcast. Today we'll discuss the latest in AI."

podcast_effects = {
    "reverb": True,
    "normalize": True,
    "noise_reduction": True,
    "compressor": True,
    "background_music": "ambient-tech",
    "fade_in": 1.0,
    "fade_out": 2.0
}

enhanced_result = enhanced_text_to_speech(podcast_text, podcast_effects)
print(f"Enhanced Speech: {enhanced_result['audioUrl']}")

Use Cases

Content Creation

  • Podcasts: Generate consistent voice for episodes
  • Audiobooks: Narrate books with different character voices
  • Video Narration: Create voiceovers for videos and presentations
  • Social Media: Generate audio content for platforms

Accessibility

  • Screen Readers: Convert text to speech for visually impaired users
  • Learning Disabilities: Help users with reading difficulties
  • Language Learning: Provide pronunciation examples
  • Elderly Care: Read news, messages, and books aloud

Business Applications

  • Call Centers: Automated voice responses and IVR
  • E-learning: Generate course narration and tutorials
  • Announcements: Create public address and notification systems
  • Marketing: Voice ads and promotional content

Entertainment

  • Gaming: Character voices and narration
  • Interactive Stories: Dynamic story narration
  • Virtual Assistants: Personalized AI assistant voices
  • Audio Drama: Multi-character voice production

Voice Quality Comparison

Different voice models offer varying quality levels:
Voice TypeQualitySpeedUse Case
NeuralHighestSlowerProfessional content
StandardGoodFastGeneral applications
MultilingualGoodMediumInternational content
CharacterVariableMediumEntertainment, storytelling

Best Practices

Text Preparation

  • Clean Text: Remove special characters that don’t translate to speech
  • Punctuation: Use proper punctuation for natural pauses
  • Abbreviations: Spell out abbreviations or use pronunciation guide
  • Numbers: Consider how numbers should be spoken (digits vs. words)

Voice Selection

  • Match Content: Choose appropriate voice for content type
  • Consistency: Use same voice for related content
  • Audience: Consider target audience preferences
  • Language: Ensure voice supports the content language

Quality Optimization

  • Sample Rate: Use 44.1kHz for general use, 48kHz for professional
  • Format: WAV for highest quality, MP3 for smaller files
  • Emotion: Match emotional tone to content
  • Speed: Adjust for content type (slower for technical, faster for casual)

Pricing

  • Standard Voices: $0.15 per 1K characters
  • Neural Voices: $0.25 per 1K characters
  • Multilingual: $0.20 per 1K characters
  • Custom Voices: $0.35 per 1K characters
  • Streaming: Additional $0.05 per 1K characters
  • Audio Effects: Additional $0.10 per minute of audio

Limitations

  • Character Limit: 10,000 characters per request
  • Processing Time: 1-10 seconds depending on text length and voice complexity
  • File Expiration: Generated audio URLs expire after 24 hours
  • Language Support: Quality varies by language, best for English
Generated audio files are automatically deleted after 30 days. Download important files for long-term storage.
For best results with technical content, use the pronunciation guide feature to ensure accurate pronunciation of specialized terms, brand names, and technical jargon.