Table of Contents
Fetching ...

The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation

Masashi Hatano, Zhifan Zhu, Hideo Saito, Dima Damen

TL;DR

This work tackles egocentric 3D hand forecasting including scenarios where hands are out of view. It introduces EgoH4, a diffusion-based transformer that jointly denoises hand and body joints conditioned on head pose, 2D hand locations, and image features, with a visibility predictor and a 3D-to-2D reprojection loss to enforce consistency. The approach is evaluated on Ego-Exo4D, using a large-scale dataset with 156k training and 34k testing sequences, and demonstrates significant improvements in hand trajectory ADE and hand pose MPJPE over strong baselines in both in-view and out-of-view settings. The method advances proactive understanding of hand motion in realistic egocentric scenarios, with implications for AR/VR and human-robot interaction where hands may frequently move outside the camera frame.

Abstract

Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting. Project page: https://masashi-hatano.github.io/EgoH4/

The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation

TL;DR

This work tackles egocentric 3D hand forecasting including scenarios where hands are out of view. It introduces EgoH4, a diffusion-based transformer that jointly denoises hand and body joints conditioned on head pose, 2D hand locations, and image features, with a visibility predictor and a 3D-to-2D reprojection loss to enforce consistency. The approach is evaluated on Ego-Exo4D, using a large-scale dataset with 156k training and 34k testing sequences, and demonstrates significant improvements in hand trajectory ADE and hand pose MPJPE over strong baselines in both in-view and out-of-view settings. The method advances proactive understanding of hand motion in realistic egocentric scenarios, with implications for AR/VR and human-robot interaction where hands may frequently move outside the camera frame.

Abstract

Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting. Project page: https://masashi-hatano.github.io/EgoH4/

Paper Structure

This paper contains 20 sections, 10 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Given signals during observation: camera poses, images, and visible hand locations in 2D, our proposed method EgoH4 forecasts future 3D hand pose. EgoH4 can forecast hand joints even when hands are out of view during observation. We show visible 2D hand positions overlaid on the observation frames $t_1$ and $t_2$, and the corresponding camera poses attached on the heads. At $t_2$, the right hand is invisible. In the forecasting frame, the right hand is back in view while the left hand is now out of view.
  • Figure 2: The framework of our proposed method, EgoH4. We show the denoising network in a single denoising step. During training, we estimate the original data $x_0$ from an arbitrary noise level $n$ to learn the denoising network. During inference, we iteratively denoise the noisy joints over the maximum diffusion step $N$ from $N$ to $0$.
  • Figure 3: Qualitative results for hand trajectory forecasting. We show sample qualitative results compared to our best-performing baseline across activities: cooking, covid testing, basketball, and dance exercises. Dots in red, green, blue, purple, and orange represent the prediction of left/right future hands, ground-truth of left/right hands, and the prediction of body joints at the last observable frame, respectively. For each track, darker colors indicate later times.
  • Figure 5: Per-timestep Hand Forecasting Accuracy. We report the hand trajectory forecasting accuracy in ADE and hand pose forecasting accuracy in MPJPE for every future timestep. Lines in blue and orange represent the performance of our model and the EgoEgoForecast baseline, respectively.