Table of Contents
Fetching ...

MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, Hesheng Wang

TL;DR

MADiff addresses egocentric hand trajectory prediction by fusing high-level semantic cues from a visual-language foundation model with a diffusion-based denoiser that incorporates camera egomotion through a motion-aware Mamba and motion-driven selective scan. The approach bridges autoregressive and iterative non-autoregressive paradigms and introduces a CDC operation to map continuous latents to discrete 2D hand waypoints, guided by directionality and stability losses. Empirical results across five public datasets show MADiff achieves state-of-the-art or competitive accuracy while maintaining real-time inference, and analyses reveal robustness to degenerate egomotion and gains from verb-specific textual prompts. The work advances practical HTP for AR/VR and robot manipulation by leveraging multimodal semantics and causality-aware diffusion dynamics.

Abstract

Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: https://irmvlab.github.io/madiff.github.io.

MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

TL;DR

MADiff addresses egocentric hand trajectory prediction by fusing high-level semantic cues from a visual-language foundation model with a diffusion-based denoiser that incorporates camera egomotion through a motion-aware Mamba and motion-driven selective scan. The approach bridges autoregressive and iterative non-autoregressive paradigms and introduces a CDC operation to map continuous latents to discrete 2D hand waypoints, guided by directionality and stability losses. Empirical results across five public datasets show MADiff achieves state-of-the-art or competitive accuracy while maintaining real-time inference, and analyses reveal robustness to degenerate egomotion and gains from verb-specific textual prompts. The work advances practical HTP for AR/VR and robot manipulation by leveraging multimodal semantics and causality-aware diffusion dynamics.

Abstract

Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: https://irmvlab.github.io/madiff.github.io.
Paper Structure (29 sections, 11 equations, 18 figures, 8 tables)

This paper contains 29 sections, 11 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: MADiff reconstructs future latents conditioned on past latents in the diffusion process. A Mamba-based model is designed to achieve motion-driven selective scan in the denoising process. The reconstructed future latent features are utilized to generate hand trajectory predictions.
  • Figure 2: System overview of MADiff. We use egocentric video clips, language description, and past 2D hand waypoints as inputs and design a Mamba-based and motion-driven denoising diffusion process to predict future 2D hand trajectories.
  • Figure 3: Visual-language fusion features extracted from a video example of EgoPAT3D-DT li2022egocentricbao2023uncertainty dataset by GLIP (average pooling over the channel dimension). GLIP attends to the target $\mathtt{hand}$ of text prompt and possible active objects, therefore extracting semantics with no need for affordance supervision. The deepest features align with the consistency in human intention, and therefore can be regarded as a high-level understanding of the interaction process. The sizes of the example feature maps from top to bottom (from shallow to deep in GLIP deep fusion) are $256\times 100\times 180$, $256\times 50\times 90$, $256\times 25\times 45$, $256\times 13\times 23$, and $256\times 7\times 12$.
  • Figure 4: Architecture of the fusion module in MADiff. It fuses semantic features from the foundation model with trajectory features from the trajectory encoder to generate tokens for the following diffusion model.
  • Figure 5: Start waypoint A and predicted end waypoint B are on the current image plane. Predicted waypoint C corresponds to the same 3D hand position as B but exists on the canvas image plane. The prediction model is empirically sensitive to the current displacement on (A$\xrightarrow{}$B), which needs to be shifted by an additional egomotion vector transformed from the homography matrix, to get the end waypoint C on canvas (A$\xrightarrow{}$B$\xrightarrow{}$C). We thus consider an additional feature update from the same homography matrix for state transition in the latent space intuitively analogous to the shift in the 2D image space, as Eq. (\ref{['eq:mdss1']}) depicts.
  • ...and 13 more figures