Table of Contents
Fetching ...

ActionParty: Multi-Subject Action Binding in Generative Video Games

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin

Abstract

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

ActionParty: Multi-Subject Action Binding in Generative Video Games

Abstract

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Left: Action binding failure case of text-to-video models with prompt: 'The red triangle moves right and the blue square moves up. Then the red triangle moves down and the blue square moves left. Then the red triangle moves up and the blue square moves right. Then the red triangle moves left and the blue square moves down.' Right: ActionParty enables action control of multiple subjects in a scene.
  • Figure 2: ActionParty pipeline. Given initial video frames $x_{0:t}$ and subject states $z_{0:t}$ as context, we aim to generate the next video frame $x_{t+1}$ conditioned on action inputs $a_{0:t}$ and a text description of the game $c$. We concatenate the video and subject state tokens along the sequence dimension and feed them into a diffusion transformer (DiT) for joint denoising. Each DiT block first runs self-attention with an attention mask $\mathcal{M}_{SA}$ and 3D RoPE biasing to render subject states to pixels in the video. It then runs cross-attention with another attention mask $\mathcal{M}_{CA}$ with explicit subject-action binding to update subject states using action inputs.
  • Figure 3: Attention mechanisms in ActionParty DiT. (a) In self-attention, we use RoPE to link a subject in a video frame to its state token $z_i^t$. We encode the state token with the subject's coordinates in the previous timestep, biasing it to attend to video tokens close to the subject. (b) In cross-attention, subject $i$'s state tokens $z^i$ is only allowed to attend to its own actions $a^i$, ensuring correct subject-action binding. We also allow the text embedding of the environment description $c$ to attend to video tokens $x$.
  • Figure 4: Qualitative comparison with baselines. We display ground truth subject positions and orientations at each step with an arrow for each subject, which reflects the ground truth subjects (notice that the arrow is consistent in all methods). Our method is the only one able to follow the ground truth actions with appropriate action binding.
  • Figure 5: Movement accuracy (MA) over autoregressive steps. ActionParty maintains stable action binding across multiple rollout steps, whereas baselines degrade over time and get close to 0.
  • ...and 2 more figures