MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos
Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, Hesheng Wang
TL;DR
MADiff addresses egocentric hand trajectory prediction by fusing high-level semantic cues from a visual-language foundation model with a diffusion-based denoiser that incorporates camera egomotion through a motion-aware Mamba and motion-driven selective scan. The approach bridges autoregressive and iterative non-autoregressive paradigms and introduces a CDC operation to map continuous latents to discrete 2D hand waypoints, guided by directionality and stability losses. Empirical results across five public datasets show MADiff achieves state-of-the-art or competitive accuracy while maintaining real-time inference, and analyses reveal robustness to degenerate egomotion and gains from verb-specific textual prompts. The work advances practical HTP for AR/VR and robot manipulation by leveraging multimodal semantics and causality-aware diffusion dynamics.
Abstract
Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: https://irmvlab.github.io/madiff.github.io.
