Table of Contents
Fetching ...

Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images

Aditya Prakash, David Forsyth, Saurabh Gupta

TL;DR

This work tackles long-horizon forecasting of bimanual 3D hand motion from a single RGB image. It introduces a two-stage approach: a lifting diffusion model $L$ that converts 2D hand keypoints into complete 3D MANO parameters, and a forecasting diffusion model $F$ that predicts future hand articulation conditioned on image features, handling multi-modality. Training leverages complete 3D labels in lab datasets for $F$ and uses $L$ to impute diverse 3D labels from abundant 2D annotations, enabling strong zero-shot generalization to everyday images like EgoExo4D. Experiments show diffusion-based forecasting outperforms regression baselines, lifting improves pseudo-label quality, and injecting 2D supervision boosts performance on in-the-wild data, with a significant 14% improvement from data diversity and 16.4% gain from the proposed forecasting approach compared to baselines.

Abstract

We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.

Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images

TL;DR

This work tackles long-horizon forecasting of bimanual 3D hand motion from a single RGB image. It introduces a two-stage approach: a lifting diffusion model that converts 2D hand keypoints into complete 3D MANO parameters, and a forecasting diffusion model that predicts future hand articulation conditioned on image features, handling multi-modality. Training leverages complete 3D labels in lab datasets for and uses to impute diverse 3D labels from abundant 2D annotations, enabling strong zero-shot generalization to everyday images like EgoExo4D. Experiments show diffusion-based forecasting outperforms regression baselines, lifting improves pseudo-label quality, and injecting 2D supervision boosts performance on in-the-wild data, with a significant 14% improvement from data diversity and 16.4% gain from the proposed forecasting approach compared to baselines.

Abstract

We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.

Paper Structure

This paper contains 11 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: ForeHand4D forecasts bimanual 3D hand motion from single RGB image input: (left) on 2 lab datasets (ARCTIC, H2O), (right) zero-shot forecasts on challenging EgoExo4D. Left hand shown in pink, right hand in blue. Color saturation decreases as time proceeds, i.e. further out timesteps are denoted by lighter shades. We render the predicted motion on the input image & from another view. Our predictions span longer trajectories, are smoother & better placed in the scene compared to the baseline, especially on everyday images from EgoExo4D.
  • Figure 2: Overall Training Pipeline. We first use the 2D & 3D annotations in lab datasets to train a lifting diffusion model, $L$ that maps 2D keypoints sequences to 3D MANO hands. We then run $L$ on diverse datasets with 2D annotations to generate 3D annotations. Finally, the forecasting model $F$ is trained on lab & diverse datasets with complete 3D supervision.
  • Figure 3: Architecture for Forecasting Model. We modify MDM Tevet2023ICLR to condition on images features extracted from a ViT backbone. Each input & output token is 198-dimensional: 2 hands $\times$ (16 (joints) $\times$ (6 (6D rotation for each joint) + 3 (wrist translation))).
  • Figure 4: Architecture for Lifting Model. We modify MDM Tevet2023ICLR to condition on a sequence of 2D hand keypoints & camera parameters. The conditioning module combines different input representations: 3D pose (rotation, translation) of camera, Plücker rays Zhang2024ICLR & KPE Prakash2023Ambiguity.
  • Figure 5: 4D hand predictions from the lifting model, that predicts 3D MANO parameters from 2D keypoints & camera parameter inputs. We show 4 frames with the MANO mesh rendered onto the image for visualization (images are not used as input).
  • ...and 3 more figures