Table of Contents
Fetching ...

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong, Zuxuan Wu

TL;DR

UniHand is presented, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis, and delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

Abstract

Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

TL;DR

UniHand is presented, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis, and delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

Abstract

Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.
Paper Structure (47 sections, 17 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 47 sections, 17 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the UniHand framework. (I) The Joint VAE aligns motion and condition encoders within a shared latent space. An autoregressive decoder iteratively reconstructs motion to preserve temporal consistency. (II) The latent diffusion model is trained on this latent space, where multimodal conditions are fused, and hand-relevant vision tokens are integrated into the denoiser.
  • Figure 2: Visualization of generated hand poses and trajectories. The first example shows a static camera scenario where the subject picks up a red bowl, with significant hand occlusion. The second example is recorded with a dynamic camera, where the subject picks up and manipulates a magic cube, involving large hand movements. UniHand produces more accurate hand motion by modeling motions in a canonical coordinate space, even without relying on explicit camera extrinsics.
  • Figure 3: Illustration of hand occlusion level computation on the DexYCB dataset.
  • Figure 4: Additional visualization of generated hand poses and trajectories.
  • Figure 5: Qualitative comparison between HaMeR and our UniHand. Our method generates more continuous and accurate hand pose sequences compared to HaMeR.
  • ...and 2 more figures