Table of Contents
Fetching ...

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long

TL;DR

iVideoGPT introduces a scalable, interactive world-model framework that unifies visual observations, actions, and rewards in an autoregressive transformer. It centers on compressive tokenization to dramatically reduce video-token length while preserving dynamics, enabling efficient pre-training on millions of manipulation trajectories and flexible fine-tuning for downstream tasks. Through video prediction, visual planning, and visual MBRL experiments, the approach achieves competitive performance with state-of-the-art methods and demonstrates strong data-efficient adaptation, including zero- and few-shot transfer with tokenizer adaptation. The work advances interactive, scalable world models, showing promise for broad deployment in robotic manipulation and embodied AI, while acknowledging limitations in data diversity and reward design in certain benchmarks.

Abstract

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

iVideoGPT: Interactive VideoGPTs are Scalable World Models

TL;DR

iVideoGPT introduces a scalable, interactive world-model framework that unifies visual observations, actions, and rewards in an autoregressive transformer. It centers on compressive tokenization to dramatically reduce video-token length while preserving dynamics, enabling efficient pre-training on millions of manipulation trajectories and flexible fine-tuning for downstream tasks. Through video prediction, visual planning, and visual MBRL experiments, the approach achieves competitive performance with state-of-the-art methods and demonstrates strong data-efficient adaptation, including zero- and few-shot transfer with tokenizer adaptation. The work advances interactive, scalable world models, showing promise for broad deployment in robotic manipulation and embodied AI, while acknowledging limitations in data diversity and reward design in certain benchmarks.

Abstract

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.
Paper Structure (66 sections, 7 equations, 23 figures, 8 tables, 1 algorithm)

This paper contains 66 sections, 7 equations, 23 figures, 8 tables, 1 algorithm.

Figures (23)

  • Figure 1: Practical applications of iVideoGPT, which is designed to scale, allowing pre-training on millions of human and robotic manipulation trajectories. This results in a single, versatile foundation of interactive world models, adaptable to a wide range of downstream tasks.
  • Figure 2: Conceptual comparison among architectures, illustrated using a single context frame ($T_0 = 1$) for simplicity. (a) Recurrent architectures for world models like Dreamer hafner2019dream and MuZero schrittwieser2020mastering provide step-level interactivity but limited scalability. (b) Recent video generation advancements like VideoGPT yan2021videogpt and Stable Video Diffusion blattmann2023alignblattmann2023stable use non-causal temporal modules that can only offer trajectory-level interactivity. (c) Our model utilizes an autoregressive transformer that separately maps each step into a sequence of tokens, achieving both scalability and interactivity.
  • Figure 3: Architecture of iVideoGPT, simplified to show only a single context frame ($T_0 = 1$). (a) Compressive tokenization utilizes a conditional VQGAN that discretizes future frames conditioned on context frames to handle temporal redundancy, significantly reducing the number of video tokens. (b) An autoregressive transformer integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, enabling interactive agent experiences through next-token prediction. Actions and rewards are optional and not included in action-free video pre-training.
  • Figure 4: Qualitative evaluation: video prediction results of iVideoGPT on Open X-Embodiment, RoboNet, and VP$^2$. Zoom in for details. Extended examples can be found in Appendix \ref{['app:qualitative']}.
  • Figure 5: Visual MPC results on the VP$^2$ benchmark. We report the mean and min/max performance of iVideoGPT over 3 control runs. On the right, we show the mean scores averaged across all tasks except flat block due to low simulator performance, normalized by the performance of the simulator.
  • ...and 18 more figures