tensorone logo

Experimental Projects

Text to Speech

Our Text to Speech (TTS) initiative focuses on making AI-generated voices more expressive, controllable, and human-like—without sacrificing speed or clarity.

We’re not just fine-tuning existing models. We’re curating datasets, training custom speakers, and experimenting with multi-style, multilingual, and emotionally-aware voice synthesis.


Areas of Contribution

1. Custom TTS Models

We experiment with and contribute back to the following architectures:

  • VITS / YourTTS: Non-autoregressive synthesis with emotion conditioning
  • Bark: Multi-lingual, multi-modal speech synthesis with music tone injection
  • StyleTTS2: Style control and speaker embedding mixing
  • Tortoise: Slow but ultra-realistic voice generation for narration

Our fine-tunes include:

  • Emotion-tagged speaker datasets (e.g., sarcasm, whisper, shout)
  • Instructional tone datasets for professional narration
  • Open-domain conversational samples with prosody variation

2. Voice Cloning + LoRA Adapters

We’ve developed LoRA adapters and speaker embeddings for:

  • Internal team voices (used in agentic pipelines)
  • Fictional character voices for agent role-play
  • Ultra-clear instructional tones for tutorials or audio UI

Voice samples are embedded and managed via a speaker registry, mapped to prompts via UUIDs for reproducibility.


3. Dataset Contributions

We maintain several clean, prompt-aligned speech datasets:

  • studio-english-50k: high-SNR, multi-emotion English recordings
  • mono-mix-8lang: multilingual dataset (fr, es, id, ja, ru, zh, en, hi)
  • agentic-dialog-speech: synthetic dialogues aligned with our agent framework

All datasets follow standardized markup (prompt text, emotion tag, speaker ID, sample path), and are compatible with TTS training libraries like ESPnet, Coqui TTS, and FastPitch.


Generation Infrastructure

We use TensorOne GPU Clusters with real-time audio preprocessing and support for:

  • Batch generation of .wav, .ogg, .mp3 formats
  • Multi-speaker pipelines in a single job
  • Post-processing filters: loudness normalization, silence trimming, reverb
  • Metadata logging (speaker ID, prompt text, timestamp, waveform hash)

Inference endpoints are deployed with:

tensoronecli project deploy --imageName "tts-serverless-bark"

Real-World Use Cases

  • Voice-enabled AI agents for research presentations
  • Synthetic podcast voices for knowledge generation
  • UI/UX narration prototypes
  • Language learners and pronunciation tutors

Upcoming Research

  • Cross-lingual speaker cloning with voice preservation
  • Agent-driven emotional conditioning during dialogue
  • Promptable speaking styles (e.g., “like a TED talk”, “like a bedtime story”)
  • Multi-modal embeddings for text + emotion + gesture control (for TTS + animation)

For us, TTS isn’t just about realism.
It’s about control, character, and creative flexibility.
We’re building voices that don’t just sound real—they feel real.


Previous
Text to Image