Table of Contents
Fetching ...

Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy

TL;DR

This work tackles data scarcity for close human interactions in Human Mesh Estimation by using LVLMs to annotate contact maps and guide test-time mesh optimization, generating the Ask Pose Unite (APU) dataset of 6209 image–mesh pairs. The authors introduce a scalable data-generation pipeline that uses LVLM outputs, 2D keypoints, and chirality cues to construct pseudo-ground-truth SMPL-XA meshes via a two-stage constrained optimization, then train a diffusion-based contact prior on FlickrCI3D, CHI3D, Hi4D, and the APU data for robust inference. A case study on NTU RGB+D 120 shows that this in-domain data improves the contact-prior representation, yielding higher accuracy especially for uncommon interactions and reducing sensitivity to contact-map errors. By providing the APU dataset, a full datasheet, and LVLM prompts, the paper offers a practical, scalable path to richer 3D understanding of close social interactions with broad applicability in social robotics and psychology research.

Abstract

Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field's capabilities of handling complex interaction scenarios.

Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

TL;DR

This work tackles data scarcity for close human interactions in Human Mesh Estimation by using LVLMs to annotate contact maps and guide test-time mesh optimization, generating the Ask Pose Unite (APU) dataset of 6209 image–mesh pairs. The authors introduce a scalable data-generation pipeline that uses LVLM outputs, 2D keypoints, and chirality cues to construct pseudo-ground-truth SMPL-XA meshes via a two-stage constrained optimization, then train a diffusion-based contact prior on FlickrCI3D, CHI3D, Hi4D, and the APU data for robust inference. A case study on NTU RGB+D 120 shows that this in-domain data improves the contact-prior representation, yielding higher accuracy especially for uncommon interactions and reducing sensitivity to contact-map errors. By providing the APU dataset, a full datasheet, and LVLM prompts, the paper offers a practical, scalable path to richer 3D understanding of close social interactions with broad applicability in social robotics and psychology research.

Abstract

Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field's capabilities of handling complex interaction scenarios.
Paper Structure (25 sections, 7 figures, 3 tables)

This paper contains 25 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Ask, Pose, Unite. We scale data acquisition for close interactions by Asking a Large Vision Language Model (LVLM) to identify contact points between people via language descriptions of the body parts that are touching. We Pose 3D meshes in the scene with predicted 2D keypoints and Unite the meshes in 3D by constraining an optimization of the mesh parameters with the predicted contacts. Through this data generation method we curate the Ask Pose Unite (APU) Human Mesh Estimation dataset for close interactions.
  • Figure 2: Distribution of interaction types. First two principal components of CLIP text embeddings on interaction names and grouped descriptions for existing datasets—CHI3D, Hi4D, FlickrCI3D, and ExPi— and our dataset. Size of points indicate quantity of examples. Our APU dataset contributes a wide range of interactions compared to existing datasets, increasing the diversity of both examples and types of interactions captured.
  • Figure 3: Examples of mesh pairs and images from our APU dataset obtained with our data generation method. Note the variety of subjects, ages, interactions, and settings.
  • Figure 4: Overview of our data generation method. From any set of images we obtain pairs of people in contact and their pseudo-ground truth meshes. For candidate pairs of people in contact we query an LVLM for their contact maps, then denoise the laterality of the contact maps via predicted 2D keypoint chirality, we use the contacts to constrain the optimization of the mesh parameters and automatically filter out failure cases.
  • Figure S5: In-context example from the TV Interactions dataset provided with the prompt.
  • ...and 2 more figures