Table of Contents
Fetching ...

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

Feichi Lu, Zijian Dong, Jie Song, Otmar Hilliges

TL;DR

AvatarPose addresses the difficulty of estimating 3D poses and shapes for multiple closely interacting people from sparse multi-view videos by learning personalized implicit neural avatars as strong priors. It jointly learns textured avatars in canonical space using an accelerated neural radiance field and SMPL-based deformation, and refines poses by optimizing color and silhouette rendering losses with a collision constraint. The method uses an alternating optimization scheme to iteratively improve avatars and poses, enabling direct pose optimization from rendering losses rather than noisy 2D joint detections. Experimental results on Hi4D, CHI3D, and related datasets demonstrate state-of-the-art performance in close-contact scenarios, highlighting robustness to occlusions and inter-person contact with efficient, scalable rendering-based supervision. The approach offers practical impact for accurate 3D human interaction capture in real-world multi-view setups, without requiring dense 3D ground-truth data.

Abstract

Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

TL;DR

AvatarPose addresses the difficulty of estimating 3D poses and shapes for multiple closely interacting people from sparse multi-view videos by learning personalized implicit neural avatars as strong priors. It jointly learns textured avatars in canonical space using an accelerated neural radiance field and SMPL-based deformation, and refines poses by optimizing color and silhouette rendering losses with a collision constraint. The method uses an alternating optimization scheme to iteratively improve avatars and poses, enabling direct pose optimization from rendering losses rather than noisy 2D joint detections. Experimental results on Hi4D, CHI3D, and related datasets demonstrate state-of-the-art performance in close-contact scenarios, highlighting robustness to occlusions and inter-person contact with efficient, scalable rendering-based supervision. The approach offers practical impact for accurate 3D human interaction capture in real-world multi-view setups, without requiring dense 3D ground-truth data.

Abstract

Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.
Paper Structure (31 sections, 12 equations, 7 figures, 3 tables)

This paper contains 31 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: We propose AvatarPose, a method for estimating the 3D poses and shapes of multiple closely interacting people from multi-view videos. To this end, we first reconstruct the avatar of each individual and leverage the learned personalized avatars as priors to refine poses via color and silhouette rendering loss. We alternate between avatar refinement and pose optimization to obtain the final pose estimation.
  • Figure 2: Method Overview: Our method consists of two modules: (a) Multi-Avatar Prior Learning: Given the input multi-view images and estimated poses $\mathbf{\Theta}^{(l)}$, we sample points $\mathbf{x}^{(l)}$ for each individual $l$ along the rays in posed space and warp these points into canonical space and calculate their color $\mathbf{c}^{(l)}$ and density $\mathbf{\sigma}^{(l)}$ via the canonical appearance network $\mathbf{\bar{F}}_{\sigma_{f}}^{(l)}$. We leverage layered volume rendering stnerf to attain the final pixel color and compare it with the original input image to optimize the parameters of avatars. (b) Avatar-guided Pose Optimization: Given learned avatar model $\mathbf{\bar{F}}_{\sigma_{f}}^{(l)}$ and initial poses $\mathbf{\Theta}^{(l)}$ of each individual $l$, we deform all of the avatars based on SMPL-based deformer and render them jointly via layered volume rendering. We compare the composite rendering with input observation and minimize the RGB and mask rendering loss to optimize poses. A collision loss is additionally introduced to avoid interpenetration. Finally, we alternate between two modules to obtain the final result. For clarity, the parameters to be optimized are marked as red in each module.
  • Figure 3: Qualitative Comparison with SotA methods mvpose4Dassociationfaster_voxelposedirect_regressiongraph on Hi4D and CHI3D. We show two examples from the Hi4D and CHI3D datasets compared with Graph, MvP, Faster VoxelPose, MVPose, and 4DAssociation. For each example, we show 2D projections on two sampled views.
  • Figure 4: Qualitative Results of our method on Hi4D (a)(b), CHI3D (c)(d), MultiHuman Real-Cap (e), and Shelf (f). The left and middle columns in each sub-figure show the 2D projections of the estimated 3D skeletons and SMPL body meshes on two views. The right column in each sub-figure demonstrates skeletons and SMPL bodies in 3D scenes.
  • Figure 5: Comparison with SMPL Body Prior. Only fitting SMPL to 2D observations, some joints in close contact such as arms are incorrectly estimated and even cause intersections between body surfaces. In contrast, our personalized prior enables accurate estimation of poses.
  • ...and 2 more figures