Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models
Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy
TL;DR
This work tackles data scarcity for close human interactions in Human Mesh Estimation by using LVLMs to annotate contact maps and guide test-time mesh optimization, generating the Ask Pose Unite (APU) dataset of 6209 image–mesh pairs. The authors introduce a scalable data-generation pipeline that uses LVLM outputs, 2D keypoints, and chirality cues to construct pseudo-ground-truth SMPL-XA meshes via a two-stage constrained optimization, then train a diffusion-based contact prior on FlickrCI3D, CHI3D, Hi4D, and the APU data for robust inference. A case study on NTU RGB+D 120 shows that this in-domain data improves the contact-prior representation, yielding higher accuracy especially for uncommon interactions and reducing sensitivity to contact-map errors. By providing the APU dataset, a full datasheet, and LVLM prompts, the paper offers a practical, scalable path to richer 3D understanding of close social interactions with broad applicability in social robotics and psychology research.
Abstract
Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field's capabilities of handling complex interaction scenarios.
