Table of Contents
Fetching ...

MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines

Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, Nataniel Ruiz

TL;DR

This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.

Abstract

Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll-out. Unlike conventional diffusion game engines that operate as next-frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.

MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines

TL;DR

This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.

Abstract

Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll-out. Unlike conventional diffusion game engines that operate as next-frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.
Paper Structure (31 sections, 11 equations, 6 figures, 2 tables)

This paper contains 31 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Level Design via Editable Memory. Users define a level through coarse 2D geometry (left). During inference, the diffusion model generates first-person observations consistent with the top-down level layout (right).
  • Figure 2: Method overview. We introduce an explicit external memory and factor the diffusion game engine into three modules: Memory (map geometry and pose; \ref{['sec:memory']}), Observation (next-frame generation conditioned on history and memory readouts; \ref{['sec:obs']}), and Dynamics (pose update for state progression; \ref{['sec:dynamics']}).
  • Figure 3: Example rollouts under an authored map and action sequence. Top: minimap $M$ with pose $p_t$ (red arrow). Middle: generated first-person observations $\hat{o}_t$. Bottom: actions $a_t$. The viewpoint evolves coherently with the action inputs while adhering with the designed layout.
  • Figure 4: Example Two-Player Gameplay Roll-out. Our method generates consistent first-person views for both players by maintaining a shared world memory. The roll-out shows a short two-player interaction: the players meet, and Player 1 kills Player 2, after which Player 2 is removed from the shared state. Player 1 then explores the map while Player 2 respawns and is re-added to the shared state. The players meet again, and Player 1 kills Player 2 once more. Note that both views are are consistent with each other, as actions from one player directly effects the next-frame observation generated by the other model. All game play frames are generated using the observation module. Frames shown during player death are not part of the model output, and are only shown for illustrative purposes.
  • Figure 5: Real-Time Interactive Multiplayer Generative Experiences. A shared consistent world state (left) enables consistent multiplayer generative experiences (right). Our method leverages a diffusion model conditioned on past frame observations, the next player action, and the external world state to generate gameplay roll-outs in real-time. The shared world state enables meaningful interactions between players, such as one player killing another (right).
  • ...and 1 more figures