Table of Contents
Fetching ...

Matrix-Game: Interactive World Foundation Model

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou

TL;DR

The paper introduces Matrix-Game, a 17B interactive world foundation model for controllable game world generation, trained via a two-stage pipeline on a large Minecraft-centric dataset (Matrix-Game-MC) that includes unlabeled and action-labeled video data. It grounds image-to-world generation in a 3D causal VAE latent space and uses a diffusion transformer with autoregressive, action-conditioned generation to achieve high visual fidelity, temporal coherence, and precise control. A new GameWorld Score benchmark evaluates visual quality, temporal dynamics, controllability, and physical rule understanding, with Matrix-Game achieving state-of-the-art results and strong human-rated performance. The work provides open-source model weights and a benchmark toolkit to advance future research in interactive, physically grounded world generation across diverse game environments.

Abstract

We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.

Matrix-Game: Interactive World Foundation Model

TL;DR

The paper introduces Matrix-Game, a 17B interactive world foundation model for controllable game world generation, trained via a two-stage pipeline on a large Minecraft-centric dataset (Matrix-Game-MC) that includes unlabeled and action-labeled video data. It grounds image-to-world generation in a 3D causal VAE latent space and uses a diffusion transformer with autoregressive, action-conditioned generation to achieve high visual fidelity, temporal coherence, and precise control. A new GameWorld Score benchmark evaluates visual quality, temporal dynamics, controllability, and physical rule understanding, with Matrix-Game achieving state-of-the-art results and strong human-rated performance. The work provides open-source model weights and a benchmark toolkit to advance future research in interactive, physically grounded world generation across diverse game environments.

Abstract

We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.

Paper Structure

This paper contains 28 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Controllable world generation results of Matrix-Game across distinct Minecraft scenarios. These demos illustrate the model's ability to handle diverse environments, ranging from desert, beach, and forest to more challenging settings like mushroom and icy biomes, while accurately responding to user control signals.
  • Figure 2: Model performance under our GameWorld Score benchmark, covering 8 key dimensions: Image Quality, Aesthetic (scaled $\times$2 for visualization), Temporal Consistency, Motion Smoothness, Keyboard Accuracy, Mouse Accuracy, Object Consistency and Scenario Consistency. Our method outperforms Oasis oasis2024 and MineWorld guo2025mineworld in all aspects, particularly in controllability (keyboard and mouse accuracy) and physical consistency, while maintaining high visual and temporal quality.
  • Figure 3: We construct our high-quality unlabeled training data from raw gameplay videos via a three-stage hierarchical filtering pipeline.
  • Figure 4: Overview of the interactive image-to-world generation paradigm. The model is trained in a spatiotemporally compressed latent space obtained through a 3D Causal VAE. Conditioned on a reference image along with Gaussian noise and action signals, it generates latent representations that are decoded into video clips. By grounding generation in the reference image, the model learns to build consistent scene representations that capture geometry, dynamics, and physical interactions, enabling the generation of temporally coherent and spatially structured videos.
  • Figure 5: (a) Autoregressive generation in Matrix-Game and (b) The architecture of Matrix-Game. To enable long-duration video generation, Matrix-Game adopts an autoregressive strategy: the last few frames of each generated clip are used as motion conditions for generating the next clip. Specifically, the latent of these motion frames are concatenated with the noisy latent along the channel dimension, and a binary mask is also concatenated to indicate which frames contain valid motion information. This design enhances local temporal consistency across video segments, allowing the model to maintain coherent dynamics over extended time horizons. Moreover, we adopt the token replacement trick in HunyuanVideo I2V kong2024hunyuanvideo to enable stable image-to-video generation.
  • ...and 10 more figures