Table of Contents
Fetching ...

Dexterous World Models

Byungjun Kim, Taeksoo Kim, Junyoung Lee, Hanbyul Joo

TL;DR

Dexterous World Models (DWM) propose a scene-action-conditioned video diffusion framework to model how dexterous hand actions induce dynamic changes in static 3D scenes. By conditioning on a static scene render along a camera path and egocentric hand mesh trajectories, DWM learns residual, action-driven dynamics while preserving the unaltered scene content. A hybrid training dataset combines synthetic, aligned triplets with fixed-camera real-world videos to provide strong supervision and realistic dynamics; the model is initialized with a pretrained inpainting diffusion prior and trained in a latent VAE space. Experiments show realistic, physically plausible interactions, generalization to unseen real-world scenes, and utility for simulation-based action evaluation, offering a foundation for embodied, interactive digital twins.

Abstract

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.

Dexterous World Models

TL;DR

Dexterous World Models (DWM) propose a scene-action-conditioned video diffusion framework to model how dexterous hand actions induce dynamic changes in static 3D scenes. By conditioning on a static scene render along a camera path and egocentric hand mesh trajectories, DWM learns residual, action-driven dynamics while preserving the unaltered scene content. A hybrid training dataset combines synthetic, aligned triplets with fixed-camera real-world videos to provide strong supervision and realistic dynamics; the model is initialized with a pretrained inpainting diffusion prior and trained in a latent VAE space. Experiments show realistic, physically plausible interactions, generalization to unseen real-world scenes, and utility for simulation-based action evaluation, offering a foundation for embodied, interactive digital twins.

Abstract

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.

Paper Structure

This paper contains 26 sections, 11 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Dexterous World Models predict egocentric visual dynamics of static 3D scenes, driven by dexterous hand manipulations.
  • Figure 2: Overview. DWM simulates egocentric visual dynamics induced by embodied actions within a given static 3D scene. We instantiate it as a video diffusion model conditioned on the egocentric projections of the static scene and hand trajectories.
  • Figure 3: Qualitative comparison on synthetic and real-world scenes with dynamic view. DWM successfully generates physically plausible simulations with dynamic view changes corresponding to the input hand actions. Notably, our method generalizes well to completely unseen real-world scenes, producing coherent action-conditioned dynamics such as opening a sliding window.
  • Figure 4: Qualitative comparison on real-world scenes with static camera. Our method produces realistic interactions with consistent scene dynamics. Baselines fail to perform meaningful actions or hallucinate incorrect interactions.
  • Figure 5: Navigation-manipulation disentanglement. Without hand-motion input, DWM simulates navigation only. Conditioning on hand motion enables the model to generate action-induced visual dynamics, highlighting navigation-manipulation disentanglement.
  • ...and 12 more figures