Text to Video

With frame-level and motion-level control, our Text to Video project pushes the boundaries of prompt-driven motion generation, ranging from abstract animations to cohesive cinematic scenes.

In addition to providing training data, testing inference chains, and developing tooling for temporally-consistent generation workflows, we are actively experimenting with diffusion-based video models.

TensorOne Image Asset

Core Areas of Focus

1. Model Research & Evaluation

We evaluate and contribute across major open-source and research models:

Google Veo V3: Used for select prompt generations; helps assess prompt engineering strategies and informs internal pipeline development
ModelScope T2V: Early-stage text-to-video diffusion
Zeroscope: SD-based 576x320 and 1024x576 resolution video models
Pika & CogVideo: Prompt-to-video with multilingual support
AnimateDiff: Motion module layered on top of image generation
VideoCrafter: Transformer-aware video diffusion

We run comparative tests on:

Motion smoothness
Prompt-object consistency
Loopability
Speed vs. quality trade-offs

2. Multi-Stage Generation Pipelines

To improve quality and control, we chain together:

Text → Keyframes (with Stable Diffusion or SDXL)
Keyframes → Interpolation (using RIFE or FILM)
Motion-aware resynthesis (via AnimateDiff or ControlNet)
Optional: Audio-to-sync for speech-matched video

We deploy this workflow using TensorOne Clusters to parallelize each stage.

3. LoRA and Motion Module Training

We've contributed custom LoRA adapters for motion styles like:

cyberpunk-tracking: stylized tracking camera motion
3d-turntable-spin: full 360° slow pan
anime-fight-loop: fast-cut, jittery action sequences

And trained motion-aware AnimateDiff modules using:

Cinemagraph datasets
TikTok-style clips with embedded captions
Storyboard-to-animation transitions

Tooling and Infrastructure

We maintain internal tools for:

Prompt-to-script breakdown
Batch video rendering across cluster queues
Scene interpolation validation
Video embedding indexing (CLIP + motion embeddings)

Sample job command:

tensoronecli create clusters --gpuType "A100" --imageName "text2video-train" --args "bash run_pipeline.sh"

Experimental Outputs

Some of our early successful generations include:

“A drone flying through a cyberpunk alleyway at night”
“An astronaut floating in space, Earth spinning in the background”
“Studio-lit product ad with animated particles and reflections”
“Low-poly 3D character dancing in sync with background music”

Research Goals Ahead

Long-form generation beyond 8s without quality collapse
Audio-synced animation with Whisper-aligned captions
Prompt-to-storyboard-to-video chaining
Fine-grained motion editing (speed ramping, masking)

Text-to-video is still an experimental frontier-but one with immense potential.