Table of Contents
Fetching ...

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein

TL;DR

This work introduces a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses, and proposes an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions.

Abstract

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

TL;DR

This work introduces a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses, and proposes an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions.

Abstract

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.
Paper Structure (38 sections, 8 equations, 10 figures, 4 tables)

This paper contains 38 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Generated reality is a concept that incorporates human-tracked data (left) into an autoregressive video generation model to enable immersive experiences (right). These generated virtual environments do not rely on laboriously designed 3D assets but are created in a zero-shot manner by the video generator. We explore diffusion transformer conditioning strategies for joint-level hand and head poses, identifying a hybrid 2D--3D strategy as the most effective approach. Our bidirectional attention-based video generator is distilled into a few-step autoregressive model, enabling interactive, human-centric experiences supporting dexterous hand--object interactions.
  • Figure 2: Diverse generations. Leveraging the implicit world knowledge of foundation video models, our system generalizes to diverse scenarios with complex interactions. Generated videos (top) are visualized with input hand conditioning overlaid. Note that, consistent with the pretraining data, input text prompts (below) are augmented with an LLM before being input into the model.
  • Figure 3: Pipeline of generated reality system. We track the head and hand poses of the user with a commercial headset. Hands are represented using the UmeTrack hand model UmeTrack, which includes translation and rotation of the wrist as well as rotation angles for 20 finger joints per hand. Our conditioning strategy employs a hybrid 2D--3D mechanism, combining a 2D image of the rendered hand skeleton (purple box, bottom) and the 3D model parameters (purple box, top). Features extracted from these modules are combined with the head pose features via token addition and fed into the diffusion transformer (DiT). The diffusion model autoregressively generates new frames at time $t$ using the last few generated frames as context in addition to the user-tracked conditioning signals.
  • Figure 4: Qualitative comparison of hand-pose conditioning strategies. Ground-truth conditioning hand input is shown in red. Predicted hands are orange; overlap is green. Our hybrid conditioning strategy is most accurate among these baselines, especially when hands are partly occluded at the boundaries of the frame.
  • Figure 5: Qualitative comparison of joint hand--camera control. Ground-truth (GT), camera-only, hand-only, and joint-control results. Camera-Ctrl and Hand-Ctrl are effective at controlling one of these modalities but not the other. Our Joint-Ctrl mechanism enables simultaneous control of camera and hands.
  • ...and 5 more figures