Research Preview

World Foundation Model
for Robotics

A 1.36B parameter Diffusion Transformer trained on 147K robot video clips, learning to encode robot manipulation into a continuous latent space.

1.36B

Parameters

147K

Video Clips

250K

Training Steps

4x A100

GPUs (29hrs)

16x8x8

Cosmos Tokenizer

Architecture

Text-conditioned video diffusion using EDM (Elucidated Diffusion Models) with a DiT backbone and NVIDIA Cosmos video tokenizer.

Text Prompt

→

T5-Large
1024-dim

→

DiT 1.36B
16L, d=2048

→

Cosmos CV8x8x8
Decoder

→

Robot Video
25f @ 256x256

Backbone

DiT, 16 layers

Hidden Dim

2048, 16 heads

3D Patchify

(1, 2, 2)

Positional

3D RoPE

Conditioning

AdaLN-LoRA r=128

Text

Cross-Attention

Training

EDM + Uncertainty

Latent Shape

[16, 4, 32, 32]

Training Data Reconstructions

Robot videos encoded through the Cosmos CV8x8x8 tokenizer into a 16-channel latent space, then decoded back. The tokenizer achieves 34-41 dB PSNR — near-lossless reconstruction.

Bridge V2Tabletop manipulation

Bridge V2Object interaction

Bridge V2Robot workspace

Bridge V2Arm motion

OXEMulti-robot data

OXEDiverse environments

OXEManipulation task

OXERobot actions

Latent Space Interpolation

Linear interpolation between two robot video latents produces smooth, coherent transitions — demonstrating that the model has learned a structured, continuous representation of robot behavior.

Interpolation A → B: 5 steps (0%, 25%, 50%, 75%, 100%)

Interpolation C → D: smooth transition through latent space

Interpolation E → F: different robot configurations blend naturally

Learned Video Variations (SDEdit)

Starting from a real robot video, we add controlled noise and denoise using the trained DiT — generating plausible variations of the original scene. The model has learned local structure of robot manipulation.

Original

Variation 1

Variation 2

Variation 3

Original

Variation 1

Variation 2

Variation 3

Original

Variation 1

Variation 2

Variation 3

Original

Variation 1

Variation 2

Variation 3

What's Next

SO-101 World Model

Dedicated model for the SO-101 robot arm using 50K+ community episodes (251 hours). Single morphology = dramatically better generation quality.

Text-to-Video Generation

Full text-conditioned video generation from noise. Type "pick up the red cup" and see the SO-101 execute it in simulation.

Action-Conditioned Planning

Use the world model to predict future states conditioned on actions — enabling model-predictive control for real robot deployment.

Learn to Build This

This World Foundation Model is one of the core topics in our Modern Robot Learning Bootcamp V2. Go from diffusion policy fundamentals to deploying world models on real SO-101 robot hardware.

Join the Bootcamp

✓ 8 live lectures

✓ 3 VLA architectures

✓ Real SO-101 hardware

✓ World models

World Foundation Modelfor Robotics

Architecture

Training Data Reconstructions

Latent Space Interpolation

Learned Video Variations (SDEdit)

What's Next

Learn to Build This

World Foundation Model
for Robotics