Table of Contents
Fetching ...

How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

Aditya Prakash, Benjamin Lundell, Dmitry Andreychuk, David Forsyth, Saurabh Gupta, Harpreet Sawhney

TL;DR

This work tackles the problem of predicting 3D hand motion and contact maps (interaction trajectories) from a single RGB image, action text, and a 3D contact point. It introduces LatentAct, a three-component framework consisting of an Interaction Codebook (InterCode) learned via a VQVAE, a Learned Indexer to map test inputs to codebook indices, and an Interaction Predictor (InterPred) based on a transformer to generate trajectories, all trained with a large-scale data engine derived from the HoloAssist dataset. The approach demonstrates strong generalization across novel objects, actions, tasks, and scenes, outperforming transformer and diffusion baselines in both forecasting and interpolation scenarios, and shows promising zero-shot transfer to ARCTIC. A key contribution is the latent codebook of interaction affordances, which enables robust hand pose and contact-map predictions and provides a scalable path toward integrating 3D object models in future work. The modular data pipeline and 2-stage training framework facilitate efficient learning of motion priors for diverse everyday interactions.

Abstract

We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.

How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

TL;DR

This work tackles the problem of predicting 3D hand motion and contact maps (interaction trajectories) from a single RGB image, action text, and a 3D contact point. It introduces LatentAct, a three-component framework consisting of an Interaction Codebook (InterCode) learned via a VQVAE, a Learned Indexer to map test inputs to codebook indices, and an Interaction Predictor (InterPred) based on a transformer to generate trajectories, all trained with a large-scale data engine derived from the HoloAssist dataset. The approach demonstrates strong generalization across novel objects, actions, tasks, and scenes, outperforming transformer and diffusion baselines in both forecasting and interpolation scenarios, and shows promising zero-shot transfer to ARCTIC. A key contribution is the latent codebook of interaction affordances, which enables robust hand pose and contact-map predictions and provides a scalable path toward integrating 3D object models in future work. The modular data pipeline and 2-stage training framework facilitate efficient learning of motion priors for diverse everyday interactions.

Abstract

We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.

Paper Structure

This paper contains 12 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Interaction Trajectories. We tackle the novel task of predicting future 3D hand poses & contact maps, i.e., interaction trajectories, from a single image showing the object, text describing the action and a 3D contact point on the object, in everyday activites. We show the trajectory predicted by our method (LatentAct) & ground truth (GT) for 3 future timesteps along with the contact point. We consider 2 settings (top) Forecasting: single RGB view, action text & 3D contact points as input, (bottom) Interpolation: goal image is also provided.
  • Figure 2: We represent contact points as binary masks on the hand mesh vertices. The hand mesh is represented in the camera coordinate system, consisting of the local hand pose in the MANO romero2017embodied coordinate frame and a global transformation from the MANO frame to the camera frame. These hand poses and contact points over several timesteps form interaction trajectories.
  • Figure 3: Overview. Our framework involves a 2-stage training procedure: (left) Interaction Codebook: to learn a latent codebook of hand poses and contact points, i.e., tokenizing interaction trajectories, (right) a learned Indexer & an Interaction Predictor module to predict the interaction trajectories from single image, action text & 3D contact point. We use pretrained features for images (from DeiT Touvron2021ICML) and text (from CLIP Radford2021ICML). 3D contact point is input as a 3D gaussian heatmap in a 3D voxel grid (omitted here for clarity).
  • Figure 4: Data engine. (top) Object masks are extracted using SAMv2 Ravi2024ARXIV, 3D hand poses & masks (2D rendering of mesh) are from HaMeR Pavlakos2024CVPR, contact points are computed by projecting the 3D hand points into the 2D contact region (intersection of hand & object masks). (bottom) Generated object masks (highlighted in white), 3D hand mesh & contact points for 3 timesteps.
  • Figure 5: More training data helps both LatentAct & HCTFormer.
  • ...and 1 more figures