Table of Contents
Fetching ...

Predicting 4D Hand Trajectory from Monocular Videos

Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, Michael J. Black

TL;DR

HaPTIC addresses the challenge of predicting coherent 4D hand trajectories (3D space over time) from monocular video by repurposing a strong image-based transformer (HaMeR) and introducing two lightweight attention mechanisms: cross-view self-attention for temporal fusion and global-context cross-attention for scene-level context. It directly predicts 4D hand trajectories in global coordinates by parameterizing depth change $\Delta d_t$ and XY offsets, avoiding the weaknesses of Weak2Full uplift. The method is trained with interleaved video and image data, achieving state-of-the-art global trajectory accuracy on allocentric and egocentric datasets while preserving 2D pose alignment, and it generalizes to single-image hand pose estimation. HaPTIC enables robust hand-object interaction reasoning in AR/VR and robotics, while providing a fast, feed-forward alternative to optimization-based 4D reconstruction and offering a strong initialization for subsequent refinement.

Abstract

We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity of annotated video data. To address these issues, we repurpose a state-of-the-art image-based transformer to take in multiple frames and directly predict a coherent trajectory. We introduce two types of lightweight attention layers: cross-view self-attention to fuse temporal information, and global cross-attention to bring in larger spatial context. Our method infers 4D hand trajectories similar to the ground truth while maintaining strong 2D reprojection alignment. We apply the method to both egocentric and allocentric videos. It significantly outperforms existing methods in global trajectory accuracy while being comparable to the state-of-the-art in single-image pose estimation. Project website: https://judyye.github.io/haptic-www

Predicting 4D Hand Trajectory from Monocular Videos

TL;DR

HaPTIC addresses the challenge of predicting coherent 4D hand trajectories (3D space over time) from monocular video by repurposing a strong image-based transformer (HaMeR) and introducing two lightweight attention mechanisms: cross-view self-attention for temporal fusion and global-context cross-attention for scene-level context. It directly predicts 4D hand trajectories in global coordinates by parameterizing depth change and XY offsets, avoiding the weaknesses of Weak2Full uplift. The method is trained with interleaved video and image data, achieving state-of-the-art global trajectory accuracy on allocentric and egocentric datasets while preserving 2D pose alignment, and it generalizes to single-image hand pose estimation. HaPTIC enables robust hand-object interaction reasoning in AR/VR and robotics, while providing a fast, feed-forward alternative to optimization-based 4D reconstruction and offering a strong initialization for subsequent refinement.

Abstract

We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity of annotated video data. To address these issues, we repurpose a state-of-the-art image-based transformer to take in multiple frames and directly predict a coherent trajectory. We introduce two types of lightweight attention layers: cross-view self-attention to fuse temporal information, and global cross-attention to bring in larger spatial context. Our method infers 4D hand trajectories similar to the ground truth while maintaining strong 2D reprojection alignment. We apply the method to both egocentric and allocentric videos. It significantly outperforms existing methods in global trajectory accuracy while being comparable to the state-of-the-art in single-image pose estimation. Project website: https://judyye.github.io/haptic-www
Paper Structure (21 sections, 6 figures, 6 tables)

This paper contains 21 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Given a monocular video depicting hand motion, HaPTIC reconstructs both Hand Pose (Pose Recon.) and 4D hand Trajectory in consistent global Coordinate (last 3 columns). The existing method produces convincing reprojection but its 4D trajectory is not plausible (side view). In contrast, our method can generate faithful 4D trajectories. The opaque hand shows reconstruction of the last frame while semi-transparent hands visualize reconstructions from previous frames. Red curves visualize root trajectories.
  • Figure 2: Overall pipeline (left): HaPTIC extends image-based model HaMeR. HaPTIC takes in $M$ frames at a time and passes them through image towers that share weights. Each image tower outputs MANO parameters in local coordinate, and trajectory parameters $d,u_{xy}$ that directly places the predicted local hand to global 4D trajectory. Inside one image tower (right): The image tower is based on transformer decoder. For each block, we add a cross-view self-attention layer (Cross-view SA) to fuse temporal information from other frames and a cross-attention (Global CA) to features of the original frames. Orange indicates new components introduced by ours compared to HaMeR.
  • Figure 3: Toy example of Weak2Full transformation: The predicted scale $s$ of both yellow and red hands are only 6% off the blue hand. Their reprojections appear similar yet the 3D position induced by Weak2Full varies a lot with large focal length.
  • Figure 4: Qualitative comparison. We compare our approach qualitatively with other feed-forward baselines on all of three datasets. We show the first, middle, and last frames of an input video. We visualize hand root trajectory as red curves and five poses along time in a global coordinate viewed from the side. Poses from previous frames are visualized with transparency while the last frame reconstructions are opaque. We only visualize the right hand for clarity but all methods can reconstruct both the left and right hand. We encourage readers to see results in videos on our website.
  • Figure 5: Qualitative comparison of optimization: We compare results before and after optimization with trajectories initialized from baselines and HaPTIC. The optimized trajectories are smoother but optimization struggles to correct global error.
  • ...and 1 more figures