Research Preview
World Foundation Model
for Robotics
A 1.36B parameter Diffusion Transformer trained on 147K robot video clips, learning to encode robot manipulation into a continuous latent space.
01
Architecture
Text-conditioned video diffusion using EDM (Elucidated Diffusion Models) with a DiT backbone and NVIDIA Cosmos video tokenizer.
Text Prompt
→
T5-Large
1024-dim
→
DiT 1.36B
16L, d=2048
→
Cosmos CV8x8x8
Decoder
→
Robot Video
25f @ 256x256
Conditioning
AdaLN-LoRA r=128
Training
EDM + Uncertainty
Latent Shape
[16, 4, 32, 32]
02
Training Data Reconstructions
Robot videos encoded through the Cosmos CV8x8x8 tokenizer into a 16-channel latent space, then decoded back. The tokenizer achieves 34-41 dB PSNR — near-lossless reconstruction.
Bridge V2Tabletop manipulation
Bridge V2Object interaction
03
Latent Space Interpolation
Linear interpolation between two robot video latents produces smooth, coherent transitions — demonstrating that the model has learned a structured, continuous representation of robot behavior.
Interpolation A → B: 5 steps (0%, 25%, 50%, 75%, 100%)
Interpolation C → D: smooth transition through latent space
Interpolation E → F: different robot configurations blend naturally
04
Learned Video Variations (SDEdit)
Starting from a real robot video, we add controlled noise and denoise using the trained DiT — generating plausible variations of the original scene. The model has learned local structure of robot manipulation.
05
What's Next
SO-101 World Model
Dedicated model for the SO-101 robot arm using 50K+ community episodes (251 hours). Single morphology = dramatically better generation quality.
Text-to-Video Generation
Full text-conditioned video generation from noise. Type "pick up the red cup" and see the SO-101 execute it in simulation.
Action-Conditioned Planning
Use the world model to predict future states conditioned on actions — enabling model-predictive control for real robot deployment.
Learn to Build This
This World Foundation Model is one of the core topics in our Modern Robot Learning Bootcamp V2. Go from diffusion policy fundamentals to deploying world models on real SO-101 robot hardware.
Join the Bootcamp
✓ 8 live lectures
✓ 3 VLA architectures
✓ Real SO-101 hardware
✓ World models