Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

Riccardo Mereu; Aidan Scannell; Yuxin Hou; Yi Zhao; Aditya Jitta; Antonio Dominguez; Luigi Acerbi; Amos Storkey; Paul Chang

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

Riccardo Mereu, Aidan Scannell, Yuxin Hou, Yi Zhao, Aditya Jitta, Antonio Dominguez, Luigi Acerbi, Amos Storkey, Paul Chang

TL;DR

This paper presents the 1X World Model Challenge for humanoid robots, with two tracks: Sampling (pixel-space future frame prediction) and Compression (future latent-token prediction). It demonstrates two scalable solutions: (i) adapting a pre-trained video foundation model Wan-2.2 TI2V-5B with video-state conditioning via adaLN-Zero and LoRA to forecast future frames, achieving first on the Sampling leaderboard with PSNR up to about 26.6 dB; (ii) training a Spatio-Temporal Transformer on tokenized latent sequences to predict future latent grids, achieving first on the Compression leaderboard with a Top-500 cross-entropy of about 6.64. The results demonstrate the practicality of large-scale foundation models for real-world humanoid world modeling, and discuss inference trade-offs between ensemble averaging and greedy decoding. Overall, the work shows that combining foundation-model-based forecasting with token-based world modeling can yield state-of-the-art results on real-world robotics benchmarks and informs practical choices for inference and training efficiency.

Abstract

World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

TL;DR

Abstract

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)