The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation
Masashi Hatano, Zhifan Zhu, Hideo Saito, Dima Damen
TL;DR
This work tackles egocentric 3D hand forecasting including scenarios where hands are out of view. It introduces EgoH4, a diffusion-based transformer that jointly denoises hand and body joints conditioned on head pose, 2D hand locations, and image features, with a visibility predictor and a 3D-to-2D reprojection loss to enforce consistency. The approach is evaluated on Ego-Exo4D, using a large-scale dataset with 156k training and 34k testing sequences, and demonstrates significant improvements in hand trajectory ADE and hand pose MPJPE over strong baselines in both in-view and out-of-view settings. The method advances proactive understanding of hand motion in realistic egocentric scenarios, with implications for AR/VR and human-robot interaction where hands may frequently move outside the camera frame.
Abstract
Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting. Project page: https://masashi-hatano.github.io/EgoH4/
