Reconstructing Close Human Interactions from Multiple Views
Qing Shuai, Zhiyuan Yu, Zhize Zhou, Lixin Fan, Haijun Yang, Can Yang, Xiaowei Zhou
TL;DR
This work tackles reconstructing 3D poses of multiple people in close-range interactions from calibrated multi-view cameras. It introduces a learning-based pipeline that uses synthetic MoCap-derived data and a 3D conditional volumetric network, taking multi-view 2D keypoint heatmaps as input to produce per-person 3D poses without requiring real image-3D pairs. A two-stage pose estimation network with heatmap supervision and anchor-guided volumes, plus center estimation/tracking and an unpose transformation, enables robust performance under heavy occlusion and close interactions. Experiments on CHI3D, Hi4D, and Panoptic show state-of-the-art accuracy and strong generalization across camera configurations and scene scales, with practical implications for animation, motion analysis, and novel view synthesis.
Abstract
This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras. The difficulty arises from the noisy or false 2D keypoint detections due to inter-person occlusion, the heavy ambiguity in associating keypoints to individuals due to the close interactions, and the scarcity of training data as collecting and annotating motion data in crowded scenes is resource-intensive. We introduce a novel system to address these challenges. Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies. The pose estimation component takes multi-view 2D keypoint heatmaps as input and reconstructs the pose of each individual using a 3D conditional volumetric network. As the network doesn't need images as input, we can leverage known camera parameters from test scenes and a large quantity of existing motion capture data to synthesize massive training data that mimics the real data distribution in test scenes. Extensive experiments demonstrate that our approach significantly surpasses previous approaches in terms of pose accuracy and is generalizable across various camera setups and population sizes. The code is available on our project page: https://github.com/zju3dv/CloseMoCap.
