Better Together: Unified Motion Capture and 3D Avatar Reconstruction

Arthur Moreau; Mohammed Brahimi; Richard Shaw; Athanasios Papaioannou; Thomas Tanay; Zhensong Zhang; Eduardo Pérez-Pellitero

Better Together: Unified Motion Capture and 3D Avatar Reconstruction

Arthur Moreau, Mohammed Brahimi, Richard Shaw, Athanasios Papaioannou, Thomas Tanay, Zhensong Zhang, Eduardo Pérez-Pellitero

TL;DR

Better Together proposes a unified, differentiable framework that jointly optimizes human pose estimation and photorealistic 3D avatar reconstruction from multi-view video. The method couples a personalized SMPLx-based mesh with a 3D Gaussian layer and time-dependent Motion MLPs to render and refine pose-consistent avatars, enabling dense, pixel-level supervision beyond keypoints. Experiments on MOYO and DNA-Rendering show state-of-the-art pose accuracy, particularly in body and hands, and notable improvements in novel-view avatar rendering quality. The work demonstrates the synergistic benefit of solving pose estimation and avatar rendering together, with practical implications for high-fidelity virtual humans and telepresence while highlighting areas for future efficiency and clothing handling improvements.

Abstract

We present Better Together, a method that simultaneously solves the human pose estimation problem while reconstructing a photorealistic 3D human avatar from multi-view videos. While prior art usually solves these problems separately, we argue that joint optimization of skeletal motion with a 3D renderable body model brings synergistic effects, i.e. yields more precise motion capture and improved visual quality of real-time rendering of avatars. To achieve this, we introduce a novel animatable avatar with 3D Gaussians rigged on a personalized mesh and propose to optimize the motion sequence with time-dependent MLPs that provide accurate and temporally consistent pose estimates. We first evaluate our method on highly challenging yoga poses and demonstrate state-of-the-art accuracy on multi-view human pose estimation, reducing error by 35% on body joints and 45% on hand joints compared to keypoint-based methods. At the same time, our method significantly boosts the visual quality of animatable avatars (+2dB PSNR on novel view synthesis) on diverse challenging subjects.

Better Together: Unified Motion Capture and 3D Avatar Reconstruction

TL;DR

Abstract

Better Together: Unified Motion Capture and 3D Avatar Reconstruction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)