Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma; Wentao Bao; Jingyi Xu; Guanzhong Sun; Yu Zheng; Erhang Zhang; Xieyuanli Chen; Hesheng Wang

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

TL;DR

Uni-Hand tackles egocentric hand motion forecasting by introducing a universal, multi-modal framework that jointly predicts multi-dimensional hand trajectories and head motion. It novelly employs a dual-branch diffusion model (EMF and HMF) with a hybrid Mamba-Transformer denoiser and task-aware text embeddings, enabling multi-target joint prediction and hand-object interaction state anticipation. The method achieves state-of-the-art results across 2D/3D benchmarks and demonstrates practical downstream benefits in robotic manipulation, action anticipation, and action recognition, including real-robot policy transfer. The work also provides comprehensive benchmarks and ablations, highlighting the importance of modality fusion, explicit hand-head decoupling, and language-conditioned guidance for real-world applicability.

Abstract

Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

TL;DR

Abstract

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)