Table of Contents
Fetching ...

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou

Abstract

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Abstract

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.
Paper Structure (46 sections, 13 equations, 7 figures, 12 tables)

This paper contains 46 sections, 13 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Teaser (Best viewed in color and zoomed in): WorldCam is an interactive 3D gaming model that enables precise action control under challenging keyboard and mouse inputs (top), supports long-horizon interactions (middle), and preserves consistent 3D geometry across viewpoints (bottom). Time (seconds at 20 FPS) is visualized in the top-left of each frame, while keyboard and mouse inputs are shown in the bottom-left and bottom-right, respectively. The red box highlights consistent 3D geometry in revisited views.
  • Figure 2: Overall architecture. WorldCam converts user actions into camera poses in Lie algebra and conditions a progressive autoregressive video transformer on these camera poses for precise action control. Retrieved long-term memory latents and camera poses from the memory pool enforce 3D consistency of the generated world, while short-term memory with an attention sink stabilizes long-horizon generation.
  • Figure 3: Dataset samples and statistics: (a) Example gameplay frames annotated with camera trajectories, and text captions. (b) Distribution of video durations. (c) Distribution of linear velocities $(v_x, v_y, v_z)$. (d) Distribution of angular velocities $(\omega_x, \omega_y, \omega_z)$. The dataset captures diverse and authentic human gameplay behaviors for training interactive gaming world models.
  • Figure 4: Qualitative comparison with recent interactive gaming world models: Compared to prior works, WorldCam faithfully follows user actions and maintains coherent 3D scene structure with high visual fidelity over long horizons.
  • Figure 5: Qualitative Results (Best viewed in color and zoomed in): (a) WorldCam enables fine-grained action control coupled with simultaneous keyboard and mouse inputs. (b) WorldCam generates long-horizon videos exceeding 10 seconds (20 FPS) without error drift. Time (seconds at 20 FPS) is visualized in the top-left of each frame, while keyboard and mouse inputs are shown in the bottom-left and bottom-right, respectively.
  • ...and 2 more figures