ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

Zehao Zhu; Jiashun Wang; Yuzhe Qin; Deqing Sun; Varun Jampani; Xiaolong Wang

ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

Zehao Zhu, Jiashun Wang, Yuzhe Qin, Deqing Sun, Varun Jampani, Xiaolong Wang

TL;DR

The paper tackles joint hand and category-level articulated object pose estimation by introducing ContactArt, a dataset collected via visual teleoperation in a simulator to obtain accurate hand/object poses and contact regions. It proposes two priors—a discriminator-based articulation prior and a diffusion-based contact map model—that are integrated into a single pipeline to improve 3D pose estimation and reduce sim-to-real gaps. Experiments on HOI4D, BMVC, and RBO show consistent improvements over state-of-the-art methods, and the ContactArt data serves as an effective warm-start for transfer learning. The approach enables scalable data collection using an iPhone and provides practical benefits for robotics and AR where human-object interactions are common.

Abstract

We propose a new dataset and a novel approach to learning hand-object interaction priors for hand and articulated object pose estimation. We first collect a dataset using visual teleoperation, where the human operator can directly play within a physical simulator to manipulate the articulated objects. We record the data and obtain free and accurate annotations on object poses and contact information from the simulator. Our system only requires an iPhone to record human hand motion, which can be easily scaled up and largely lower the costs of data and annotation collection. With this data, we learn 3D interaction priors including a discriminator (in a GAN) capturing the distribution of how object parts are arranged, and a diffusion model which generates the contact regions on articulated objects, guiding the hand pose estimation. Such structural and contact priors can easily transfer to real-world data with barely any domain gap. By using our data and learned priors, our method significantly improves the performance on joint hand and articulated object poses estimation over the existing state-of-the-art methods. The project is available at https://zehaozhu.github.io/ContactArt/ .

ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 7 figures, 7 tables)

This paper contains 16 sections, 9 equations, 7 figures, 7 tables.

Introduction
Related Work
ContactArt Dataset
Method
Object Pose Estimator
Articulation Discriminator
Contact Diffusion Model
Test Time Adaptation
Experiments
Datasets
Metrics and Methods for Comparison
Object Pose Estimation Comparison
Hand Pose Estimation Comparison
Generalization Comparison
Ablation Study
...and 1 more sections

Figures (7)

Figure 1: Overview. We collect a dataset named ContactArt, which is created by human interacting with the articulated objects in a simulator, using teleoperation. Two interaction priors are learned from ContactArt: (i) a contact prior predicted by a diffusion model to improve 3D hand pose estimation; (ii) an articulation prior with a discriminator to improve category-level articulated object pose estimation. We visualize the pose estimation results in real-world data, leveraging the learned priors.
Figure 2: To collect ContactArt, the hardware requirement is an iPhone and a laptop. The system allows us to easily scale up the dataset without human annotation effort. We can collect manipulation sequences and render images from different camera views.
Figure 3: Training and Testing framework.Left: During training, we first extract a point-wise feature and regress the part segmentation, NOCS map, and part-level rotation. We then compute the per-part 3D bounding box and feed it to a discriminator. We utilize a diffusion model conditioned on the point-wise feature to estimate the contact map. We visualize the contact points in green. $\bigoplus$ denotes concatenation. Right: At test time, we utilize the discriminator with fixed parameters to calculate adversarial loss and backpropagate the gradients to update the object estimator. Then we optimize hand pose by "pulling" hand closer to the contact points at the predicted contact map.
Figure 4: Qualitative comparison of object pose estimation. We use red box to indicate error larger than 10$^\circ$ or 10 cm. Image-based baselines fail to get an accurate pose. And our method also performs better than the tracking-based method weng2021captra.
Figure 5: Optimization process. The hand is reaching the predicted contact map and getting to the correct pose.
...and 2 more figures

ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

TL;DR

Abstract

ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)