Table of Contents
Fetching ...

Solaris: Building a Multiplayer Video World Model in Minecraft

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie

TL;DR

This work introduces Solaris, a multiplayer video world model that simulates consistent multi-view observations and develops a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft.

Abstract

Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.

Solaris: Building a Multiplayer Video World Model in Minecraft

TL;DR

This work introduces Solaris, a multiplayer video world model that simulates consistent multi-view observations and develops a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft.

Abstract

Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
Paper Structure (38 sections, 2 equations, 14 figures, 7 tables, 2 algorithms)

This paper contains 38 sections, 2 equations, 14 figures, 7 tables, 2 algorithms.

Figures (14)

  • Figure 1: Selected samples from our model. Our model takes in starting frames from each player as input and generates action-conditioned videos. The action descriptions shown here are summaries of the fine-grained action sequences given to the model that span many frames. The third-person ground truth visualizations are not given to the model.
  • Figure 2: SolarisEngine Overview.(Left) Docker-based orchestration of containerized game server, camera, and controller bots. Cameras mirror Controllers' state and actions via a custom server-side plugin; Controllers are Mineflayer bots that run episode code and log low-level actions. (Right) Episodes compose reusable skill primitives from a shared library. Simplified "collector" episode code is shown.
  • Figure 3: Dataset Statistics of our training dataset.(Left) The dataset consists of four different episode categories focusing on building, combat, movement, and mining scenarios, respectively. (Middle) It has a total of 9,240 episodes and 6.32 M frames per player, for a combined 12.64 M frames. Episode types are chosen randomly with weights that decrease with respect to the typical episode length. (Right) Most episode lengths range from 128 to 512 frames or 6.4 to 25.6 seconds (we record at 20 fps).
  • Figure 4: Episode Demonstrations from our training dataset. We show the recorded frames from 3 different training episodes at various points in time. Note that the third-person "start state" and "end state" screenshots are for visualization only and are not part of the dataset.
  • Figure 5: Our modified DiT block achieves multiplayer modeling through visual interleaving along the sequence dimension. We denote the number of players with $N$ and the number of tokens per video with $M$. Multiplayer information is exchanged through a shared self-attention block. The other modules are unchanged from Matrix Game 2.0 and applied independently per player.
  • ...and 9 more figures