Table of Contents
Fetching ...

Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models

Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, Elisa Ricci

TL;DR

This paper presents Promptable Game Models (PGMs), a data-driven framework that enables semantic, language-guided control of game dynamics and rendering. It combines a synthesis model based on a compositional NeRF for high-quality, controllable rendering with an animation model that uses a text-conditioned masked diffusion transformer to simulate complex game dynamics and game AI. The authors introduce two richly annotated datasets (Tennis and Minecraft) to support learning and evaluation, and demonstrate that PGMs outperform prior neural video game simulators in rendering quality and enable new capabilities such as director's mode and opponent modeling. This approach paves the way for accessible, low-cost game modeling and video editing, with potential impact on game development workflows and future AI-assisted simulation research.

Abstract

Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment's state driven by the actions of its agents. While such a paradigm enables users to play a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with prompts specified as a set of natural language actions and desired states. The result-a Promptable Game Model (PGM)-makes it possible for a user to play the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning "game AI", encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art. Our framework, data, and models are available at https://snap-research.github.io/promptable-game-models/.

Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models

TL;DR

This paper presents Promptable Game Models (PGMs), a data-driven framework that enables semantic, language-guided control of game dynamics and rendering. It combines a synthesis model based on a compositional NeRF for high-quality, controllable rendering with an animation model that uses a text-conditioned masked diffusion transformer to simulate complex game dynamics and game AI. The authors introduce two richly annotated datasets (Tennis and Minecraft) to support learning and evaluation, and demonstrate that PGMs outperform prior neural video game simulators in rendering quality and enable new capabilities such as director's mode and opponent modeling. This approach paves the way for accessible, low-cost game modeling and video editing, with potential impact on game development workflows and future AI-assisted simulation research.

Abstract

Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment's state driven by the actions of its agents. While such a paradigm enables users to play a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with prompts specified as a set of natural language actions and desired states. The result-a Promptable Game Model (PGM)-makes it possible for a user to play the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning "game AI", encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art. Our framework, data, and models are available at https://snap-research.github.io/promptable-game-models/.
Paper Structure (74 sections, 10 equations, 17 figures, 7 tables)

This paper contains 74 sections, 10 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: (a) Overview of our framework. The animation model produces states $\mathbf{s}$ based on user-provided conditioning signals, or prompts, $\mathbf{s}^c, \mathbf{a}^c$ that are rendered by the synthesis model. (b) The diffusion-based animation model predicts noise $\boldsymbol{\epsilon}_k$ applied to the noisy states $\mathbf{{s}}^p_k$ conditioned on known states $\mathbf{s}^c$ and actions $\mathbf{a}^c$ with the respective masks $\mathbf{m}^\mathbf{s}, \mathbf{m}^\mathbf{a}$, diffusion step $k$ and framerate $\nu$. The text encoder$\mathcal{T}$ produces embedding for the textual actions, while the temporal model$\mathcal{A}$ performs noise prediction. (c) The synthesis model renders the current state using a composition of neural radiance fields, one for each object. A style encoder$\mathcal{E}$ extracts the appearance $\boldsymbol{\omega}$ of each object. Each object is represented in its canonical pose by $\mathcal{C}$ and deformations of articulated objects are modeled by the deformation model$\mathcal{D}$. After integration and composition, the feature grid $\mathbf{G}$ is rendered to the final image using the feature enhancer$\mathcal{F}$.
  • Figure 2: Different sequences predicted on the Tennis and Minecraft datasets starting from the same initial state and altering the text conditioning. Our model moves players and designates shot targets using domain-specific referential language (eg. "right service box", "no man's land", "baseline"). The model supports fine-grained control over the various tennis shots using technical terms (eg. "forehand", "backhand", "volley").
  • Figure 3: Sequences generated by specifying actions for one of the players and letting the model act as the game AI and take control of the opponent. The game AI successfully responds to the actions of the player by running to the right (see top sequence) or towards the net (see bottom sequence), following two challenging shots of the user-controlled player.
  • Figure 4: Sequences generated without any user conditioning signal. The actions of all players are controlled by the model that acts as the game AI. In tennis, the players produce a realistic exchange, with the bottom player advancing aggressively toward the net and the top player defeating him with a shot along the right sideline. The Minecraft player and tennis ball trajectories are highlighted for better visualization.
  • Figure 5: Given an initial and a final state, we generate all the states in between. We repeat the generation multiple times conditioning it using different actions indicating the desired intermediate waypoints.
  • ...and 12 more figures