Table of Contents
Fetching ...

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

Riccardo Mereu, Aidan Scannell, Yuxin Hou, Yi Zhao, Aditya Jitta, Antonio Dominguez, Luigi Acerbi, Amos Storkey, Paul Chang

TL;DR

This paper presents the 1X World Model Challenge for humanoid robots, with two tracks: Sampling (pixel-space future frame prediction) and Compression (future latent-token prediction). It demonstrates two scalable solutions: (i) adapting a pre-trained video foundation model Wan-2.2 TI2V-5B with video-state conditioning via adaLN-Zero and LoRA to forecast future frames, achieving first on the Sampling leaderboard with PSNR up to about 26.6 dB; (ii) training a Spatio-Temporal Transformer on tokenized latent sequences to predict future latent grids, achieving first on the Compression leaderboard with a Top-500 cross-entropy of about 6.64. The results demonstrate the practicality of large-scale foundation models for real-world humanoid world modeling, and discuss inference trade-offs between ensemble averaging and greedy decoding. Overall, the work shows that combining foundation-model-based forecasting with token-based world modeling can yield state-of-the-art results on real-world robotics benchmarks and informs practical choices for inference and training efficiency.

Abstract

World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

TL;DR

This paper presents the 1X World Model Challenge for humanoid robots, with two tracks: Sampling (pixel-space future frame prediction) and Compression (future latent-token prediction). It demonstrates two scalable solutions: (i) adapting a pre-trained video foundation model Wan-2.2 TI2V-5B with video-state conditioning via adaLN-Zero and LoRA to forecast future frames, achieving first on the Sampling leaderboard with PSNR up to about 26.6 dB; (ii) training a Spatio-Temporal Transformer on tokenized latent sequences to predict future latent grids, achieving first on the Compression leaderboard with a Top-500 cross-entropy of about 6.64. The results demonstrate the practicality of large-scale foundation models for real-world humanoid world modeling, and discuss inference trade-offs between ensemble averaging and greedy decoding. Overall, the work shows that combining foundation-model-based forecasting with token-based world modeling can yield state-of-the-art results on real-world robotics benchmarks and informs practical choices for inference and training efficiency.

Abstract

World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

Paper Structure

This paper contains 21 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the 1X World Model Challenges Left depicts the context (inputs), middle the model generations, and right the evaluations. Sampling challenge (top): The model observes 17 past frames along with past and future robot states, then generates future frames in pixel space. Performance is measured by PSNR between the predicted and ground-truth 77th frame. Compression challenge (bottom): The Cosmos $8 \times 8 \times 8$ tokeniser encodes the history of 17 RGB frames into three latent token grids of shape $3 \times 32 \times 32$. Models must predict the next three latent token grids corresponding to the next 17 frames. Evaluation is based on Top-500 cross-entropy between predicted and ground-truth tokens.
  • Figure 2: State conditioning of DiT-Block. Wan2.2 TI2V-5B DiT architecture was updated to enable state conditioning using adaLN-Zeropeebles2023dit and combining it with the timestep of the Flow Matching scheduler wan_paper.
  • Figure 3: Overall figure showing (a) the ST-Transformer world model architecture and (b) its training curves in the compression challenge.