Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images
Aditya Prakash, David Forsyth, Saurabh Gupta
TL;DR
This work tackles long-horizon forecasting of bimanual 3D hand motion from a single RGB image. It introduces a two-stage approach: a lifting diffusion model $L$ that converts 2D hand keypoints into complete 3D MANO parameters, and a forecasting diffusion model $F$ that predicts future hand articulation conditioned on image features, handling multi-modality. Training leverages complete 3D labels in lab datasets for $F$ and uses $L$ to impute diverse 3D labels from abundant 2D annotations, enabling strong zero-shot generalization to everyday images like EgoExo4D. Experiments show diffusion-based forecasting outperforms regression baselines, lifting improves pseudo-label quality, and injecting 2D supervision boosts performance on in-the-wild data, with a significant 14% improvement from data diversity and 16.4% gain from the proposed forecasting approach compared to baselines.
Abstract
We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.
