Table of Contents
Fetching ...

3D Reconstruction of Interacting Multi-Person in Clothing from a Single Image

Junuk Cha, Hansol Lee, Jaewon Kim, Nhat Nguyen Bao Truong, Jae Shin Yoon, Seungryul Baek

TL;DR

This work presents a monocular pipeline for reconstructing complete, clothed multi-person geometry in scene space from a single image. It introduces two priors—a geometry prior via an implicit, detail-rich 3D generator and a surface-contact prior learned from images—to enable global pose refinement that prevents penetrations and enforces realistic interactions. The method achieves state-of-the-art results on occluded multi-person datasets, and ablations confirm the contribution of the contact-based refinement and enhanced geometry prior. The approach enables plausible AR/VR scene synthesis and consistent rendering of interacting people from a single view, with practical inference speeds on modern GPUs.

Abstract

This paper introduces a novel pipeline to reconstruct the geometry of interacting multi-person in clothing on a globally coherent scene space from a single image. The main challenge arises from the occlusion: a part of a human body is not visible from a single view due to the occlusion by others or the self, which introduces missing geometry and physical implausibility (e.g., penetration). We overcome this challenge by utilizing two human priors for complete 3D geometry and surface contacts. For the geometry prior, an encoder learns to regress the image of a person with missing body parts to the latent vectors; a decoder decodes these vectors to produce 3D features of the associated geometry; and an implicit network combines these features with a surface normal map to reconstruct a complete and detailed 3D humans. For the contact prior, we develop an image-space contact detector that outputs a probability distribution of surface contacts between people in 3D. We use these priors to globally refine the body poses, enabling the penetration-free and accurate reconstruction of interacting multi-person in clothing on the scene space. The results demonstrate that our method is complete, globally coherent, and physically plausible compared to existing methods.

3D Reconstruction of Interacting Multi-Person in Clothing from a Single Image

TL;DR

This work presents a monocular pipeline for reconstructing complete, clothed multi-person geometry in scene space from a single image. It introduces two priors—a geometry prior via an implicit, detail-rich 3D generator and a surface-contact prior learned from images—to enable global pose refinement that prevents penetrations and enforces realistic interactions. The method achieves state-of-the-art results on occluded multi-person datasets, and ablations confirm the contribution of the contact-based refinement and enhanced geometry prior. The approach enables plausible AR/VR scene synthesis and consistent rendering of interacting people from a single view, with practical inference speeds on modern GPUs.

Abstract

This paper introduces a novel pipeline to reconstruct the geometry of interacting multi-person in clothing on a globally coherent scene space from a single image. The main challenge arises from the occlusion: a part of a human body is not visible from a single view due to the occlusion by others or the self, which introduces missing geometry and physical implausibility (e.g., penetration). We overcome this challenge by utilizing two human priors for complete 3D geometry and surface contacts. For the geometry prior, an encoder learns to regress the image of a person with missing body parts to the latent vectors; a decoder decodes these vectors to produce 3D features of the associated geometry; and an implicit network combines these features with a surface normal map to reconstruct a complete and detailed 3D humans. For the contact prior, we develop an image-space contact detector that outputs a probability distribution of surface contacts between people in 3D. We use these priors to globally refine the body poses, enabling the penetration-free and accurate reconstruction of interacting multi-person in clothing on the scene space. The results demonstrate that our method is complete, globally coherent, and physically plausible compared to existing methods.
Paper Structure (26 sections, 8 equations, 14 figures, 5 tables)

This paper contains 26 sections, 8 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Given a single RGB image of interacting multiple people with occlusion, we reconstruct their complete geometry. Our method estimates the contact signature between people, which provides a strong pose refinement cue in 3D to prevent penetration.
  • Figure 2: System overview. Given an image of interacting people, we aim to reconstruct multi-person geometry. Our pipeline is composed of three stages. (Top): In the generation stage, we extract coarse meshes for the multi-person from the input image. (Bottom): In the contact estimation stage, we detect the region of contact between individuals in the image. Finally, in the contact-based refinement stage, we generate detailed multi-person geometry by leveraging the information obtained from the previous two stages.
  • Figure 3: The details of the pipeline for each module in our system.
  • Figure 4: Ablation study on normal map $\mathbf{N}$. (a) and (c) are the input RGB image and normal map input, respectively. (b) and (d) show the result without and with normal map, respectively. Normal map $\mathbf{N}$ provides more detailed surface information for reconstructing mesh.
  • Figure 5: Ablation study on the elements of $L_\text{opt}$. First row shows the results in front view. Second row shows the results in top view. (a) is RGB image input. (b) shows initial posed meshes. (c) shows the results without using $L_\text{penet}$. (d) shows the results without using pose and shape prior loss, $L_\text{reg}$ and $L_\text{gmm}$. (e) shows the results of our full method.
  • ...and 9 more figures