tensorone logo

AI Agents & Tools

Text to Video

With frame-level and motion-level control, our Text to Video project pushes the boundaries of prompt-driven motion generation, ranging from abstract animations to cohesive cinematic scenes.

In addition to providing training data, testing inference chains, and developing tooling for temporally-consistent generation workflows, we are actively experimenting with diffusion-based video models.


TensorOne Image Asset

Core Areas of Focus

1. Model Research & Evaluation

We evaluate and contribute across major open-source and research models:

  • Google Veo V3: Used for select prompt generations; helps assess prompt engineering strategies and informs internal pipeline development
  • ModelScope T2V: Early-stage text-to-video diffusion
  • Zeroscope: SD-based 576x320 and 1024x576 resolution video models
  • Pika & CogVideo: Prompt-to-video with multilingual support
  • AnimateDiff: Motion module layered on top of image generation
  • VideoCrafter: Transformer-aware video diffusion

We run comparative tests on:

  • Motion smoothness
  • Prompt-object consistency
  • Loopability
  • Speed vs. quality trade-offs

2. Multi-Stage Generation Pipelines

To improve quality and control, we chain together:

  • Text → Keyframes (with Stable Diffusion or SDXL)
  • Keyframes → Interpolation (using RIFE or FILM)
  • Motion-aware resynthesis (via AnimateDiff or ControlNet)
  • Optional: Audio-to-sync for speech-matched video

We deploy this workflow using TensorOne Clusters to parallelize each stage.

3. LoRA and Motion Module Training

We've contributed custom LoRA adapters for motion styles like:

  • cyberpunk-tracking: stylized tracking camera motion
  • 3d-turntable-spin: full 360° slow pan
  • anime-fight-loop: fast-cut, jittery action sequences

And trained motion-aware AnimateDiff modules using:

  • Cinemagraph datasets
  • TikTok-style clips with embedded captions
  • Storyboard-to-animation transitions

Tooling and Infrastructure

We maintain internal tools for:

  • Prompt-to-script breakdown
  • Batch video rendering across cluster queues
  • Scene interpolation validation
  • Video embedding indexing (CLIP + motion embeddings)

Sample job command:

tensoronecli create clusters --gpuType "A100" --imageName "text2video-train" --args "bash run_pipeline.sh"

Experimental Outputs

Some of our early successful generations include:

  • “A drone flying through a cyberpunk alleyway at night”
  • “An astronaut floating in space, Earth spinning in the background”
  • “Studio-lit product ad with animated particles and reflections”
  • “Low-poly 3D character dancing in sync with background music”

Research Goals Ahead

  • Long-form generation beyond 8s without quality collapse
  • Audio-synced animation with Whisper-aligned captions
  • Prompt-to-storyboard-to-video chaining
  • Fine-grained motion editing (speed ramping, masking)

Text-to-video is still an experimental frontier-but one with immense potential.

Previous
Text to Speech