Table of Contents
Fetching ...

Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption

Buzhen Huang, Chen Li, Chongyang Xu, Liang Pan, Yangang Wang, Gim Hee Lee

TL;DR

This work tackles the challenge of reconstructing closely interactive humans from monocular video by introducing a distribution-adaption framework that fuses proxemic knowledge, physical plausibility, and image guidance. A dual-branch discrete interaction prior via VQ-VAE encodes two-person interactions, while a diffusion-based adaptor refines initial per-person predictions under proxemics and contact constraints. The model blends an initial single-person pose predictor with a dual-branch diffusion process that leverages a penetration loss and projection gradients to enforce physical and visual consistency, achieving state-of-the-art results on Hi4D, 3DPW, and CHI3D for closely interactive scenarios. This approach advances monocular reconstruction by explicitly modeling social interaction patterns and physical contacts, with potential implications for motion capture, human-robot interaction, and social scene understanding.

Abstract

Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration, but overlook the modeling of close interactions. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this, we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically, we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information, our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate that our method outperforms existing approaches. The code is available at \url{https://github.com/boycehbz/HumanInteraction}.

Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption

TL;DR

This work tackles the challenge of reconstructing closely interactive humans from monocular video by introducing a distribution-adaption framework that fuses proxemic knowledge, physical plausibility, and image guidance. A dual-branch discrete interaction prior via VQ-VAE encodes two-person interactions, while a diffusion-based adaptor refines initial per-person predictions under proxemics and contact constraints. The model blends an initial single-person pose predictor with a dual-branch diffusion process that leverages a penetration loss and projection gradients to enforce physical and visual consistency, achieving state-of-the-art results on Hi4D, 3DPW, and CHI3D for closely interactive scenarios. This approach advances monocular reconstruction by explicitly modeling social interaction patterns and physical contacts, with potential implications for motion capture, human-robot interaction, and social scene understanding.

Abstract

Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration, but overlook the modeling of close interactions. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this, we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically, we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information, our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate that our method outperforms existing approaches. The code is available at \url{https://github.com/boycehbz/HumanInteraction}.
Paper Structure (28 sections, 15 equations, 7 figures, 6 tables)

This paper contains 28 sections, 15 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Our method reconstructs closely interactive humans with plausible body poses, natural proxemic relationships and accurate physical contacts from single-view inputs. To address this challenging task, we formulate the reconstruction as a distribution adaption from the initial prediction (c). Compared to the existing solution, BUDDI muller2023generative (b), our method (d) is more robust to visual ambiguity.
  • Figure 2: Overview of our method. Given a monocular video with interactive humans (a), we first regress an initial distribution for each person (Inter-person penetration is marked in red) (b). To refine the distribution, we design a proxemics and physics guided diffusion model to achieve distribution adaption (c). Specifically, the motions drawn from the initial distributions are updated by a discrete interaction prior. The updated motions are then fed into a dual-branch diffusion model to denoise under the guidance of physics and image observations. The denoised motions are then used as the input for the next timestep. The adaption takes several diffusion timesteps and finally produce accurate results (d).
  • Figure 3: The prior has a dual-branch structure, with each branch representing the motion of a character. Each branch has a codebook learned by VQ-VAE, which models interactive behaviours. In addition to the codebook, the two branches share the same weights, and they can exchange information with the cross-attention module.
  • Figure 4: Qualitative comparison with BUDDI muller2023generative and BEV sun2022putting. Our method can produce more accurate body poses and physical contacts.
  • Figure 5: Ablation study. The initial prediction is severely affected by visual ambiguity and cannot reconstruct natural interaction. The interaction prior update the motions from initial prediction with proxemic behaviours. Although the prior can produce better interaction, it still suffers from inter-person penetrations. With the distribution adaption, our method can refine the results, and reconstruct accurate interaction and physical contacts.
  • ...and 2 more figures