Vizuara
Research Preview

World Foundation Model
for Robotics

A 1.36B parameter Diffusion Transformer trained on 147K robot video clips, learning to encode robot manipulation into a continuous latent space.

1.36B
Parameters
147K
Video Clips
250K
Training Steps
4x A100
GPUs (29hrs)
16x8x8
Cosmos Tokenizer
01

Architecture

Text-conditioned video diffusion using EDM (Elucidated Diffusion Models) with a DiT backbone and NVIDIA Cosmos video tokenizer.

Text Prompt
T5-Large
1024-dim
DiT 1.36B
16L, d=2048
Cosmos CV8x8x8
Decoder
Robot Video
25f @ 256x256
Backbone
DiT, 16 layers
Hidden Dim
2048, 16 heads
3D Patchify
(1, 2, 2)
Positional
3D RoPE
Conditioning
AdaLN-LoRA r=128
Text
Cross-Attention
Training
EDM + Uncertainty
Latent Shape
[16, 4, 32, 32]
02

Training Data Reconstructions

Robot videos encoded through the Cosmos CV8x8x8 tokenizer into a 16-channel latent space, then decoded back. The tokenizer achieves 34-41 dB PSNR — near-lossless reconstruction.

Bridge V2Tabletop manipulation
Bridge V2Object interaction
Bridge V2Robot workspace
Bridge V2Arm motion
OXEMulti-robot data
OXEDiverse environments
OXEManipulation task
OXERobot actions
03

Latent Space Interpolation

Linear interpolation between two robot video latents produces smooth, coherent transitions — demonstrating that the model has learned a structured, continuous representation of robot behavior.

Interpolation A → B: 5 steps (0%, 25%, 50%, 75%, 100%)
Interpolation C → D: smooth transition through latent space
Interpolation E → F: different robot configurations blend naturally
04

Learned Video Variations (SDEdit)

Starting from a real robot video, we add controlled noise and denoise using the trained DiT — generating plausible variations of the original scene. The model has learned local structure of robot manipulation.

Original
Variation 1
Variation 2
Variation 3
Original
Variation 1
Variation 2
Variation 3
Original
Variation 1
Variation 2
Variation 3
Original
Variation 1
Variation 2
Variation 3
05

What's Next

SO-101 World Model
Dedicated model for the SO-101 robot arm using 50K+ community episodes (251 hours). Single morphology = dramatically better generation quality.
Text-to-Video Generation
Full text-conditioned video generation from noise. Type "pick up the red cup" and see the SO-101 execute it in simulation.
Action-Conditioned Planning
Use the world model to predict future states conditioned on actions — enabling model-predictive control for real robot deployment.

Learn to Build This

This World Foundation Model is one of the core topics in our Modern Robot Learning Bootcamp V2. Go from diffusion policy fundamentals to deploying world models on real SO-101 robot hardware.

Join the Bootcamp
8 live lectures
3 VLA architectures
Real SO-101 hardware
World models