Table of Contents
Fetching ...

From Virtual Games to Real-World Play

Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen, Hongyang Zhang, Jun Zhang, Yan Lu

TL;DR

RealPlay tackles interactive real-world video generation by reframing it as a chunk-wise diffusion problem that accepts user control signals. It comprises a two-stage approach: first adapting a pre-trained image-to-video generator to produce short, iterative chunks, then fine-tuning on a mixed dataset of labeled game data and unlabeled real-world videos with an adaptive modulation of action signals. The method achieves strong control transfer from game to real-world entities (vehicles, bicycles, pedestrians) and outperforms both single-shot and prior chunk-wise baselines, showing robust temporal coherence and realism. This work demonstrates a compelling step toward neural real-world game engines that learn realistic dynamics from data, reducing reliance on annotated real-world action data.

Abstract

We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: https://wenqsun.github.io/RealPlay/

From Virtual Games to Real-World Play

TL;DR

RealPlay tackles interactive real-world video generation by reframing it as a chunk-wise diffusion problem that accepts user control signals. It comprises a two-stage approach: first adapting a pre-trained image-to-video generator to produce short, iterative chunks, then fine-tuning on a mixed dataset of labeled game data and unlabeled real-world videos with an adaptive modulation of action signals. The method achieves strong control transfer from game to real-world entities (vehicles, bicycles, pedestrians) and outperforms both single-shot and prior chunk-wise baselines, showing robust temporal coherence and realism. This work demonstrates a compelling step toward neural real-world game engines that learn realistic dynamics from data, reducing reliance on annotated real-world action data.

Abstract

We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: https://wenqsun.github.io/RealPlay/

Paper Structure

This paper contains 13 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: RealPlay is a neural-network-driven real-world game engine with three key characteristics: (1) It supports iterative interaction—at each iteration, users observe the current visual scene, provide control signals, and receive control-accurate, temporally consistent, and realistic video chunks in response. (2) It eliminates the need for annotated real-world data while exhibiting strong control transfer capabilities, effectively mapping control signals (e.g., "move forward", "turn left" and "turn right") from the game environment to the real world. (3) It demonstrates entity transfer capabilities: although the labeled game data are sourced exclusively from the car racing game Forza Horizon 5, RealPlay successfully generalizes these control signals to other real-world entities such as (a) bicycles and (b) pedestrians, beyond (c) vehicles. Additional visualizations are provided in the appendix.
  • Figure 2: RealPlay involves a two-stage training process. Stage-1: We adapt a pre-trained image-to-video generator (Figure (a))—which generates an entire video in a single pass conditioned on a single frame—into a chunk-wise generation model (Figure (b)), which generates video chunks iteratively, conditioned on the previously generated chunk. This adaptation includes several key modifications detailed in Section \ref{['sec:chunk-wise']}. Stage-2: RealPlay (Figure (c)) is trained on a combination of a labeled game dataset and an unlabeled real-world dataset, enabling action transfer from controlling a car in the game environment to manipulating various entities in the real world. This is achieved by modifying the chunk-wise generation model to incorporate action control through an adaptive LayerNorm mechanism. In all figures, "frames" refer to frame latents encoded by the video VAE encoder from CogVideoX yang2024cogvideox. For clarity, we omit the details of injecting noise timestep embeddings.
  • Figure 3: Visual quality degrades in both game and real-world settings, but the image quality when controlling a game entity consistently remains higher than that of a real-world entity (e.g., the bicycle in this study), highlighting the greater challenge of modeling real-world entities.
  • Figure 4: Reducing the number of video latents per chunk leads to visual quality degradation, as the pre-trained video generator—originally optimized for long-horizon generation—loses temporal coherence and consistency when adapted to extremely short-horizon outputs (e.g., 1 latent).
  • Figure 5: Both the control success rate and Elo scores steadily improve as training progresses. The evaluation is performed on the bicycle entity.
  • ...and 8 more figures