Text to Speech

Generate high-quality, natural-sounding speech from text using state-of-the-art neural text-to-speech models. Perfect for accessibility, content creation, voice assistants, and audio content production.

Request Body

text

string

required

Text to convert to speech (up to 10,000 characters per request)

voice

string

default:"neural-female-1"

Voice model to use for speech generation:Neural Voices (Premium):

neural-female-1 - Natural female voice (English)
neural-male-1 - Natural male voice (English)
neural-female-2 - Warm female voice (English)
neural-male-2 - Professional male voice (English)

Multilingual Voices:

multilingual-female - Supports 40+ languages
multilingual-male - Supports 40+ languages

Character Voices:

storyteller - Narrative and audiobook style
news-anchor - Professional news reading style
conversational - Casual, friendly tone
assistant - AI assistant style

language

string

default:"en"

Language code for speech synthesis:

en - English (US)
en-GB - English (UK)
es - Spanish
fr - French
de - German
it - Italian
pt - Portuguese
ja - Japanese
ko - Korean
zh - Chinese (Mandarin)
ru - Russian
ar - Arabic
hi - Hindi
And 30+ more languages

speed

number

default:"1.0"

Speech speed multiplier (0.25 to 4.0). Values below 1.0 slow down speech, above 1.0 speed it up.

pitch

number

default:"0.0"

Pitch adjustment in semitones (-20 to +20). Negative values lower pitch, positive values raise it.

volume

number

default:"0.0"

Volume adjustment in dB (-20 to +20). Negative values decrease volume, positive values increase it.

emotion

string

default:"neutral"

Emotional tone for the speech:

neutral - Balanced, natural tone
happy - Upbeat and cheerful
sad - Melancholic and somber
angry - Intense and forceful
excited - Energetic and enthusiastic
calm - Peaceful and relaxed
whisper - Soft, quiet delivery
shouting - Loud, emphatic delivery

outputFormat

string

default:"mp3"

Audio output format:

mp3 - MP3 format (good compression, widely supported)
wav - WAV format (uncompressed, highest quality)
ogg - OGG Vorbis format (open source, good compression)
aac - AAC format (high quality, good compression)
flac - FLAC format (lossless compression)

sampleRate

integer

default:"44100"

Audio sample rate in Hz:

22050 - Standard quality (smaller files)
44100 - CD quality (recommended)
48000 - Professional quality (larger files)

enableSSML

boolean

default:"false"

Whether to process SSML (Speech Synthesis Markup Language) tags in the text

addPauses

boolean

default:"true"

Whether to automatically add natural pauses at punctuation

pronunciationGuide

array

Custom pronunciation for specific words

Show Pronunciation Entry

word

string

required

Word to customize pronunciation for

pronunciation

string

required

Phonetic pronunciation (IPA or custom format)

Response

audioUrl

string

URL to download the generated audio file (expires in 24 hours)

audioBase64

string

Base64-encoded audio data (if requested in smaller files)

duration

number

Duration of the generated audio in seconds

wordCount

integer

Number of words in the input text

characterCount

integer

Number of characters processed

audioFormat

string

Format of the generated audio file

fileSize

string

Size of the generated audio file (e.g., “2.3MB”)

voice

string

Voice model used for generation

language

string

Language used for speech synthesis

metadata

object

Generation metadata

Show Metadata

processingTime

number

Time taken to generate speech in seconds

sampleRate

integer

Audio sample rate used

bitRate

integer

Audio bit rate in kbps

channels

integer

Number of audio channels (1 for mono, 2 for stereo)

Example

curl -X POST "https://api.tensorone.ai/v2/ai/text-to-speech" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello! Welcome to TensorOne. This is a demonstration of our advanced text-to-speech technology.",
    "voice": "neural-female-1",
    "language": "en",
    "speed": 1.0,
    "emotion": "friendly",
    "outputFormat": "mp3",
    "sampleRate": 44100
  }'

{
  "audioUrl": "https://audio.tensorone.ai/generated/speech_abc123.mp3",
  "duration": 8.5,
  "wordCount": 15,
  "characterCount": 86,
  "audioFormat": "mp3",
  "fileSize": "136KB",
  "voice": "neural-female-1",
  "language": "en",
  "metadata": {
    "processingTime": 3.2,
    "sampleRate": 44100,
    "bitRate": 128,
    "channels": 1
  }
}

SSML Support

Use Speech Synthesis Markup Language for advanced speech control:

Python

# Generate speech with SSML markup
ssml_text = """
<speak>
    <p>Welcome to <emphasis level="strong">TensorOne</emphasis>!</p>
    
    <p>Here's what we can do:</p>
    <break time="500ms"/>
    
    <p rate="slow">Generate high-quality speech</p>
    <p rate="fast">Process text lightning fast</p>
    <p pitch="high">With perfect voice control</p>
    
    <p>
        Visit us at <say-as interpret-as="characters">API</say-as> dot tensorone dot 
        <prosody pitch="low" rate="slow">A-I</prosody>
    </p>
</speak>
"""

ssml_result = requests.post(
    "https://api.tensorone.ai/v2/ai/text-to-speech",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "text": ssml_text,
        "voice": "neural-female-1",
        "language": "en",
        "enableSSML": True,
        "outputFormat": "wav"
    }
)

print(f"SSML Speech URL: {ssml_result.json()['audioUrl']}")

Batch Text-to-Speech

Generate multiple audio files in one request:

Python

def batch_text_to_speech(texts, voice="neural-female-1"):
    results = []
    
    for i, text in enumerate(texts):
        response = requests.post(
            "https://api.tensorone.ai/v2/ai/text-to-speech",
            headers={"Authorization": "Bearer YOUR_API_KEY"},
            json={
                "text": text,
                "voice": voice,
                "language": "en",
                "outputFormat": "mp3"
            }
        )
        
        result = response.json()
        result['index'] = i
        result['original_text'] = text
        results.append(result)
    
    return results

# Generate multiple audio files
texts = [
    "Welcome to chapter one: Introduction to AI.",
    "Chapter two covers machine learning basics.",
    "In chapter three, we explore neural networks.",
    "Chapter four discusses practical applications."
]

batch_results = batch_text_to_speech(texts, "storyteller")

for result in batch_results:
    print(f"Chapter {result['index'] + 1}: {result['audioUrl']}")
    print(f"Duration: {result['duration']} seconds")
    
    # Download each file
    audio_response = requests.get(result['audioUrl'])
    filename = f"chapter_{result['index'] + 1}.mp3"
    with open(filename, 'wb') as f:
        f.write(audio_response.content)
    
    print(f"Saved: {filename}")

Multilingual Speech Generation

Generate speech in multiple languages:

Python

def multilingual_speech(text_translations):
    """
    Generate speech for multiple language versions of the same text
    text_translations: dict with language codes as keys, text as values
    """
    results = {}
    
    for lang_code, text in text_translations.items():
        response = requests.post(
            "https://api.tensorone.ai/v2/ai/text-to-speech",
            headers={"Authorization": "Bearer YOUR_API_KEY"},
            json={
                "text": text,
                "voice": "multilingual-female",
                "language": lang_code,
                "speed": 1.0,
                "emotion": "neutral",
                "outputFormat": "mp3"
            }
        )
        
        results[lang_code] = response.json()
    
    return results

# Generate the same message in multiple languages
translations = {
    "en": "Welcome to our platform! We're excited to have you here.",
    "es": "¡Bienvenido a nuestra plataforma! Estamos emocionados de tenerte aquí.",
    "fr": "Bienvenue sur notre plateforme ! Nous sommes ravis de vous avoir ici.",
    "de": "Willkommen auf unserer Plattform! Wir freuen uns, Sie hier zu haben.",
    "ja": "私たちのプラットフォームへようこそ！あなたがここにいることを嬉しく思います。"
}

multilingual_results = multilingual_speech(translations)

for lang, result in multilingual_results.items():
    print(f"{lang.upper()}: {result['audioUrl']}")
    
    # Download each language version
    audio_response = requests.get(result['audioUrl'])
    with open(f'welcome_{lang}.mp3', 'wb') as f:
        f.write(audio_response.content)

Voice Customization

Fine-tune voice characteristics:

Python

def custom_voice_speech(text, voice_settings):
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": voice_settings.get("voice", "neural-female-1"),
            "language": voice_settings.get("language", "en"),
            "speed": voice_settings.get("speed", 1.0),
            "pitch": voice_settings.get("pitch", 0.0),
            "volume": voice_settings.get("volume", 0.0),
            "emotion": voice_settings.get("emotion", "neutral"),
            "outputFormat": "wav",
            "sampleRate": 48000
        }
    )
    return response.json()

# Create different character voices
character_voices = {
    "narrator": {
        "voice": "storyteller",
        "speed": 0.9,
        "pitch": -2.0,
        "emotion": "calm"
    },
    "hero": {
        "voice": "neural-male-1",
        "speed": 1.1,
        "pitch": 3.0,
        "emotion": "excited"
    },
    "villain": {
        "voice": "neural-male-2",
        "speed": 0.8,
        "pitch": -5.0,
        "emotion": "angry"
    }
}

story_text = "The brave knight approached the dark castle, ready for battle."

for character, settings in character_voices.items():
    result = custom_voice_speech(story_text, settings)
    print(f"{character.title()}: {result['audioUrl']}")

Pronunciation Customization

Control pronunciation of specific words:

Python

def speech_with_pronunciation(text, pronunciations):
    pronunciation_guide = [
        {"word": word, "pronunciation": pronunciation}
        for word, pronunciation in pronunciations.items()
    ]
    
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": "neural-female-1",
            "language": "en",
            "pronunciationGuide": pronunciation_guide,
            "outputFormat": "mp3"
        }
    )
    return response.json()

# Technical text with custom pronunciations
tech_text = "The TensorOne API uses OAuth authentication and RESTful endpoints."

custom_pronunciations = {
    "TensorOne": "TEN-sor-wun",
    "API": "A-P-I",
    "OAuth": "OH-auth",
    "RESTful": "REST-ful"
}

result = speech_with_pronunciation(tech_text, custom_pronunciations)
print(f"Technical Speech: {result['audioUrl']}")

Real-time Streaming

Stream audio as it’s generated (for long texts):

Python

def streaming_text_to_speech(text, voice="neural-female-1"):
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech/stream",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": voice,
            "language": "en",
            "outputFormat": "mp3",
            "streamChunkSize": 1024  # Bytes per chunk
        },
        stream=True
    )
    
    # Save streamed audio
    with open('streamed_speech.mp3', 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                print(".", end="", flush=True)  # Progress indicator
    
    print("\nStreaming complete!")

# Stream long text
long_text = """
This is a long text that will be converted to speech using streaming.
The audio will be generated and streamed back in real-time, allowing
for immediate playback even before the entire text is processed.
This is particularly useful for long documents, articles, or books
where waiting for the complete audio file would take too long.
"""

streaming_text_to_speech(long_text, "storyteller")

Audio Post-Processing

Apply effects and enhancements to generated speech:

Python

def enhanced_text_to_speech(text, effects=None):
    if effects is None:
        effects = {}
    
    response = requests.post(
        "https://api.tensorone.ai/v2/ai/text-to-speech/enhanced",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "text": text,
            "voice": "neural-female-1",
            "language": "en",
            "outputFormat": "wav",
            "audioEffects": {
                "reverb": effects.get("reverb", False),
                "echo": effects.get("echo", False),
                "normalize": effects.get("normalize", True),
                "noiseReduction": effects.get("noise_reduction", True),
                "compressor": effects.get("compressor", False)
            },
            "backgroundMusic": effects.get("background_music"),
            "fadeIn": effects.get("fade_in", 0.5),
            "fadeOut": effects.get("fade_out", 0.5)
        }
    )
    return response.json()

# Generate speech with audio effects
podcast_text = "Welcome to our weekly tech podcast. Today we'll discuss the latest in AI."

podcast_effects = {
    "reverb": True,
    "normalize": True,
    "noise_reduction": True,
    "compressor": True,
    "background_music": "ambient-tech",
    "fade_in": 1.0,
    "fade_out": 2.0
}

enhanced_result = enhanced_text_to_speech(podcast_text, podcast_effects)
print(f"Enhanced Speech: {enhanced_result['audioUrl']}")

Use Cases

Content Creation

Podcasts: Generate consistent voice for episodes
Audiobooks: Narrate books with different character voices
Video Narration: Create voiceovers for videos and presentations
Social Media: Generate audio content for platforms

Accessibility

Screen Readers: Convert text to speech for visually impaired users
Learning Disabilities: Help users with reading difficulties
Language Learning: Provide pronunciation examples
Elderly Care: Read news, messages, and books aloud

Business Applications

Call Centers: Automated voice responses and IVR
E-learning: Generate course narration and tutorials
Announcements: Create public address and notification systems
Marketing: Voice ads and promotional content

Entertainment

Gaming: Character voices and narration
Interactive Stories: Dynamic story narration
Virtual Assistants: Personalized AI assistant voices
Audio Drama: Multi-character voice production

Voice Quality Comparison

Different voice models offer varying quality levels:

Voice Type	Quality	Speed	Use Case
Neural	Highest	Slower	Professional content
Standard	Good	Fast	General applications
Multilingual	Good	Medium	International content
Character	Variable	Medium	Entertainment, storytelling

Best Practices

Text Preparation

Clean Text: Remove special characters that don’t translate to speech
Punctuation: Use proper punctuation for natural pauses
Abbreviations: Spell out abbreviations or use pronunciation guide
Numbers: Consider how numbers should be spoken (digits vs. words)

Voice Selection

Match Content: Choose appropriate voice for content type
Consistency: Use same voice for related content
Audience: Consider target audience preferences
Language: Ensure voice supports the content language

Quality Optimization

Sample Rate: Use 44.1kHz for general use, 48kHz for professional
Format: WAV for highest quality, MP3 for smaller files
Emotion: Match emotional tone to content
Speed: Adjust for content type (slower for technical, faster for casual)

Pricing

Standard Voices: $0.15 per 1K characters
Neural Voices: $0.25 per 1K characters
Multilingual: $0.20 per 1K characters
Custom Voices: $0.35 per 1K characters
Streaming: Additional $0.05 per 1K characters
Audio Effects: Additional $0.10 per minute of audio

Limitations

Character Limit: 10,000 characters per request
Processing Time: 1-10 seconds depending on text length and voice complexity
File Expiration: Generated audio URLs expire after 24 hours
Language Support: Quality varies by language, best for English

Generated audio files are automatically deleted after 30 days. Download important files for long-term storage.

For best results with technical content, use the pronunciation guide feature to ensure accurate pronunciation of specialized terms, brand names, and technical jargon.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Request Body

Response

Example

SSML Support

Batch Text-to-Speech

Multilingual Speech Generation

Voice Customization

Pronunciation Customization

Real-time Streaming

Audio Post-Processing

Use Cases

Content Creation

Accessibility

Business Applications

Entertainment

Voice Quality Comparison

Best Practices

Text Preparation

Voice Selection

Quality Optimization

Pricing

Limitations

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Request Body

​Response

​Example

​SSML Support

​Batch Text-to-Speech

​Multilingual Speech Generation

​Voice Customization

​Pronunciation Customization

​Real-time Streaming

​Audio Post-Processing

​Use Cases

​Content Creation

​Accessibility

​Business Applications

​Entertainment

​Voice Quality Comparison

​Best Practices

​Text Preparation

​Voice Selection

​Quality Optimization

​Pricing

​Limitations

Request Body

Response

Example

SSML Support

Batch Text-to-Speech

Multilingual Speech Generation

Voice Customization

Pronunciation Customization

Real-time Streaming

Audio Post-Processing

Use Cases

Content Creation

Accessibility

Business Applications

Entertainment

Voice Quality Comparison

Best Practices

Text Preparation

Voice Selection

Quality Optimization

Pricing

Limitations