Table of Contents
Fetching ...

Reconstructing Close Human Interactions from Multiple Views

Qing Shuai, Zhiyuan Yu, Zhize Zhou, Lixin Fan, Haijun Yang, Can Yang, Xiaowei Zhou

TL;DR

This work tackles reconstructing 3D poses of multiple people in close-range interactions from calibrated multi-view cameras. It introduces a learning-based pipeline that uses synthetic MoCap-derived data and a 3D conditional volumetric network, taking multi-view 2D keypoint heatmaps as input to produce per-person 3D poses without requiring real image-3D pairs. A two-stage pose estimation network with heatmap supervision and anchor-guided volumes, plus center estimation/tracking and an unpose transformation, enables robust performance under heavy occlusion and close interactions. Experiments on CHI3D, Hi4D, and Panoptic show state-of-the-art accuracy and strong generalization across camera configurations and scene scales, with practical implications for animation, motion analysis, and novel view synthesis.

Abstract

This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras. The difficulty arises from the noisy or false 2D keypoint detections due to inter-person occlusion, the heavy ambiguity in associating keypoints to individuals due to the close interactions, and the scarcity of training data as collecting and annotating motion data in crowded scenes is resource-intensive. We introduce a novel system to address these challenges. Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies. The pose estimation component takes multi-view 2D keypoint heatmaps as input and reconstructs the pose of each individual using a 3D conditional volumetric network. As the network doesn't need images as input, we can leverage known camera parameters from test scenes and a large quantity of existing motion capture data to synthesize massive training data that mimics the real data distribution in test scenes. Extensive experiments demonstrate that our approach significantly surpasses previous approaches in terms of pose accuracy and is generalizable across various camera setups and population sizes. The code is available on our project page: https://github.com/zju3dv/CloseMoCap.

Reconstructing Close Human Interactions from Multiple Views

TL;DR

This work tackles reconstructing 3D poses of multiple people in close-range interactions from calibrated multi-view cameras. It introduces a learning-based pipeline that uses synthetic MoCap-derived data and a 3D conditional volumetric network, taking multi-view 2D keypoint heatmaps as input to produce per-person 3D poses without requiring real image-3D pairs. A two-stage pose estimation network with heatmap supervision and anchor-guided volumes, plus center estimation/tracking and an unpose transformation, enables robust performance under heavy occlusion and close interactions. Experiments on CHI3D, Hi4D, and Panoptic show state-of-the-art accuracy and strong generalization across camera configurations and scene scales, with practical implications for animation, motion analysis, and novel view synthesis.

Abstract

This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras. The difficulty arises from the noisy or false 2D keypoint detections due to inter-person occlusion, the heavy ambiguity in associating keypoints to individuals due to the close interactions, and the scarcity of training data as collecting and annotating motion data in crowded scenes is resource-intensive. We introduce a novel system to address these challenges. Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies. The pose estimation component takes multi-view 2D keypoint heatmaps as input and reconstructs the pose of each individual using a 3D conditional volumetric network. As the network doesn't need images as input, we can leverage known camera parameters from test scenes and a large quantity of existing motion capture data to synthesize massive training data that mimics the real data distribution in test scenes. Extensive experiments demonstrate that our approach significantly surpasses previous approaches in terms of pose accuracy and is generalizable across various camera setups and population sizes. The code is available on our project page: https://github.com/zju3dv/CloseMoCap.
Paper Structure (44 sections, 8 equations, 12 figures, 6 tables)

This paper contains 44 sections, 8 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Challenges in pose estimation with close proximity. This image highlights that when two individuals are in close proximity, it becomes difficult to obtain accurate 2D pose estimates due to heavy inter-person occlusion and keypoint association ambiguity. Moreover, in learning-based methods that directly regress 3D poses from feature volumes, the similarity in constructed volumes due to their spatial closeness complicates keypoint distinction for regression networks.
  • Figure 2: Diverse camera position and orientation in various datasets. We show the camera positions and orientations of several common datasets: Panoptic panoptic (blue), CHI3D chi3d (black), and Hi4D hi4d (red). These datasets, collected by different studios, exhibit significant variations. In practical scenarios like a basketball court, scale differences become more apparent, posing challenges for network generalization across datasets.
  • Figure 3: Illustration of our method. For a multi-view scene, we first estimate (a) the 2D keypoint heatmaps of all people from input images. We then recover (b) the 3D centers of all people from these heatmaps. Following this, we construct keypoint feature volumes and anchor-guided feature volumes, which are subsequently fed through (c) the pose estimation network. The proposed network initially predicts the 3D heatmaps from the keypoint feature volumes and then utilizes these 3D heatmaps along with the anchor-guided feature volumes to generate (d) the 3D keypoints for each person. If the 3D keypoints from the previous time step are available, they can be used to filter the 3D heatmaps. The entire network training does not require real image-3D keypoints pairs; instead, can be accomplished with only synthetic data.
  • Figure 4: Coordinate transformation in 3D pose estimation. We apply "unpose" operation to the estimated torso part (marked by the pink line in (a)), which is transformed into a standard space thus reducing the influence of global rotation.
  • Figure 5: Two-stage design. This image highlights the main difference between our approach and the previous methods in the field. Given the feature volume obtained through multiple viewpoints (a), the previous methods directly estimate the keypoint probability volume (b) of a target person. In contrast, we propose a two-stage method. The first Heatmap Estimation Module focuses on identifying and filtering out the noise present in the input feature volume and outputs a cleaned response volume of all individuals (c), while the second Keypoint Localization Module leverages the cleaned response volume and the conditional inputs to acquire the desired keypoint probability volume for each individual. This two-stage design allows the network to gain a better understanding of the scene.
  • ...and 7 more figures