Table of Contents
Fetching ...

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

TL;DR

Uni-Hand tackles egocentric hand motion forecasting by introducing a universal, multi-modal framework that jointly predicts multi-dimensional hand trajectories and head motion. It novelly employs a dual-branch diffusion model (EMF and HMF) with a hybrid Mamba-Transformer denoiser and task-aware text embeddings, enabling multi-target joint prediction and hand-object interaction state anticipation. The method achieves state-of-the-art results across 2D/3D benchmarks and demonstrates practical downstream benefits in robotic manipulation, action anticipation, and action recognition, including real-robot policy transfer. The work also provides comprehensive benchmarks and ablations, highlighting the importance of modality fusion, explicit hand-head decoupling, and language-conditioned guidance for real-world applicability.

Abstract

Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

TL;DR

Uni-Hand tackles egocentric hand motion forecasting by introducing a universal, multi-modal framework that jointly predicts multi-dimensional hand trajectories and head motion. It novelly employs a dual-branch diffusion model (EMF and HMF) with a hybrid Mamba-Transformer denoiser and task-aware text embeddings, enabling multi-target joint prediction and hand-object interaction state anticipation. The method achieves state-of-the-art results across 2D/3D benchmarks and demonstrates practical downstream benefits in robotic manipulation, action anticipation, and action recognition, including real-robot policy transfer. The work also provides comprehensive benchmarks and ablations, highlighting the importance of modality fusion, explicit hand-head decoupling, and language-conditioned guidance for real-world applicability.

Abstract

Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

Paper Structure

This paper contains 40 sections, 6 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: Uni-Hand is a universal hand motion forecasting framework which facilitates multi-dimensional and multi-target predictions with multi-modal input. It also enables multi-task affordances for downstream applications.
  • Figure 2: System overview of Uni-Hand. Uni-Hand (a) converts multi-modal input into latent feature spaces, and (b) decouples predictions of future egomotion latents (EM latents) and hand motion latents (HM latents) by a novel dual diffusion. The vanilla Mamba (VM) is used for denoising in the ego-motion-forecasting diffusion (EMF diffusion). We further design a new denoising model in hand-motion-forecasting diffusion (HMF diffusion) with a hybrid Mamba-Transformer module (HMTM). The predicted HM latents are ultimately decoded to future hand trajectories and interaction states.
  • Figure 3: Architecture of the VL-fusion module. It generates HM latents for the following HMF diffusion by fusing vision-language features, waypoint features, and task-aware text embeddings.
  • Figure 4: Hand removal for purified point clouds. We regard the voxel patches encoded by the voxel encoder as 3D global context for the denoising process in the HMF diffusion.
  • Figure 5: Examples of head movement (corresponding to camera egomotion) and hand movement entangled during the hand-object interaction process in egocentric views in the EgoPAT3D dataset li2022egocentric. Here we present the RGB images and point clouds, as well as camera poses to clarify the hand-head motion trend.
  • ...and 14 more figures