Text to Speech

Our Text to Speech (TTS) project aims to create natural-sounding, expressive, and controllable AI-generated voices without sacrificing clarity or speed. In addition to conventional TTS, we investigate AI voice cloning, which provides multi-speaker and customised generation via speaker embeddings and LoRA adapters.

In addition to training custom voice models and curating high-quality datasets, we are also experimenting with multilingual, multi-emotion, and style-transfer synthesis. Our objective is to advance the quality and control of voice AI, whether it be through the creation of cloned agents, fictional characters, or instructional voices.

TensorOne Image Asset

Areas of Contribution

1. Custom TTS Models

We experiment with and contribute back to the following architectures:

VITS / YourTTS: Non-autoregressive synthesis with emotion conditioning
Bark: Multilingual, multimodal speech synthesis with music tone injection
StyleTTS2: Style control and speaker embedding mixing
Tortoise: Slow but ultra-realistic voice generation for narration

Our fine-tunes include:

Emotion-tagged speaker datasets (e.g., sarcasm, whisper, shout)
Instructional tone datasets for professional narration
Open-domain conversational samples with prosody variation

2. Voice Cloning + LoRA Adapters

We’ve developed LoRA adapters and speaker embeddings for:

Internal team voices (used in agentic pipelines)
Fictional character voices for agent role-play
Ultra-clear instructional tones for tutorials or audio UI

Voice samples are embedded and managed via a speaker registry, mapped to prompts via UUIDs for reproducibility.

3. Dataset Contributions

We maintain several clean, prompt-aligned speech datasets:

studio-english-50k: high-SNR, multi-emotion English recordings
mono-mix-8lang: multilingual dataset (fr, es, id, ja, ru, zh, en, hi)
agentic-dialog-speech: synthetic dialogues aligned with our agent framework

All datasets follow standardized markup (prompt text, emotion tag, speaker ID, sample path), and are compatible with TTS training libraries like ESPnet, Coqui TTS, and FastPitch.

Generation Infrastructure

We use TensorOne GPU Clusters with real-time audio preprocessing and support for:

Batch generation of .wav, .ogg, .mp3 formats
Multi-speaker pipelines in a single job
Post-processing filters: loudness normalization, silence trimming, reverb
Metadata logging (speaker ID, prompt text, timestamp, waveform hash)

Inference endpoints are deployed with:

tensoronecli project deploy --imageName "tts-serverless-bark"

Real-World Use Cases

Voice-enabled AI agents for research presentations
Synthetic podcast voices for knowledge generation
UI/UX narration prototypes
Language learners and pronunciation tutors

Upcoming Research

Cross-lingual speaker cloning with voice preservation
Agent-driven emotional conditioning during dialogue
Promptable speaking styles (e.g., “like a TED talk”, “like a bedtime story”)
Multi-modal embeddings for text + emotion + gesture control (for TTS + animation)