AI Agents & Tools
Text to Video
With frame-level and motion-level control, our Text to Video project pushes the boundaries of prompt-driven motion generation, ranging from abstract animations to cohesive cinematic scenes.
In addition to providing training data, testing inference chains, and developing tooling for temporally-consistent generation workflows, we are actively experimenting with diffusion-based video models.
Core Areas of Focus
1. Model Research & Evaluation
We evaluate and contribute across major open-source and research models:
- Google Veo V3: Used for select prompt generations; helps assess prompt engineering strategies and informs internal pipeline development
- ModelScope T2V: Early-stage text-to-video diffusion
- Zeroscope: SD-based 576x320 and 1024x576 resolution video models
- Pika & CogVideo: Prompt-to-video with multilingual support
- AnimateDiff: Motion module layered on top of image generation
- VideoCrafter: Transformer-aware video diffusion
We run comparative tests on:
- Motion smoothness
- Prompt-object consistency
- Loopability
- Speed vs. quality trade-offs
2. Multi-Stage Generation Pipelines
To improve quality and control, we chain together:
- Text → Keyframes (with Stable Diffusion or SDXL)
- Keyframes → Interpolation (using RIFE or FILM)
- Motion-aware resynthesis (via AnimateDiff or ControlNet)
- Optional: Audio-to-sync for speech-matched video
We deploy this workflow using TensorOne Clusters to parallelize each stage.
3. LoRA and Motion Module Training
We've contributed custom LoRA adapters for motion styles like:
cyberpunk-tracking
: stylized tracking camera motion3d-turntable-spin
: full 360° slow pananime-fight-loop
: fast-cut, jittery action sequences
And trained motion-aware AnimateDiff modules using:
- Cinemagraph datasets
- TikTok-style clips with embedded captions
- Storyboard-to-animation transitions
Tooling and Infrastructure
We maintain internal tools for:
- Prompt-to-script breakdown
- Batch video rendering across cluster queues
- Scene interpolation validation
- Video embedding indexing (CLIP + motion embeddings)
Sample job command:
tensoronecli create clusters --gpuType "A100" --imageName "text2video-train" --args "bash run_pipeline.sh"
Experimental Outputs
Some of our early successful generations include:
- “A drone flying through a cyberpunk alleyway at night”
- “An astronaut floating in space, Earth spinning in the background”
- “Studio-lit product ad with animated particles and reflections”
- “Low-poly 3D character dancing in sync with background music”
Research Goals Ahead
- Long-form generation beyond 8s without quality collapse
- Audio-synced animation with Whisper-aligned captions
- Prompt-to-storyboard-to-video chaining
- Fine-grained motion editing (speed ramping, masking)
Text-to-video is still an experimental frontier-but one with immense potential.