AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos
Feichi Lu, Zijian Dong, Jie Song, Otmar Hilliges
TL;DR
AvatarPose addresses the difficulty of estimating 3D poses and shapes for multiple closely interacting people from sparse multi-view videos by learning personalized implicit neural avatars as strong priors. It jointly learns textured avatars in canonical space using an accelerated neural radiance field and SMPL-based deformation, and refines poses by optimizing color and silhouette rendering losses with a collision constraint. The method uses an alternating optimization scheme to iteratively improve avatars and poses, enabling direct pose optimization from rendering losses rather than noisy 2D joint detections. Experimental results on Hi4D, CHI3D, and related datasets demonstrate state-of-the-art performance in close-contact scenarios, highlighting robustness to occlusions and inter-person contact with efficient, scalable rendering-based supervision. The approach offers practical impact for accurate 3D human interaction capture in real-world multi-view setups, without requiring dense 3D ground-truth data.
Abstract
Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.
