Table of Contents
Fetching ...

Better Together: Unified Motion Capture and 3D Avatar Reconstruction

Arthur Moreau, Mohammed Brahimi, Richard Shaw, Athanasios Papaioannou, Thomas Tanay, Zhensong Zhang, Eduardo Pérez-Pellitero

TL;DR

Better Together proposes a unified, differentiable framework that jointly optimizes human pose estimation and photorealistic 3D avatar reconstruction from multi-view video. The method couples a personalized SMPLx-based mesh with a 3D Gaussian layer and time-dependent Motion MLPs to render and refine pose-consistent avatars, enabling dense, pixel-level supervision beyond keypoints. Experiments on MOYO and DNA-Rendering show state-of-the-art pose accuracy, particularly in body and hands, and notable improvements in novel-view avatar rendering quality. The work demonstrates the synergistic benefit of solving pose estimation and avatar rendering together, with practical implications for high-fidelity virtual humans and telepresence while highlighting areas for future efficiency and clothing handling improvements.

Abstract

We present Better Together, a method that simultaneously solves the human pose estimation problem while reconstructing a photorealistic 3D human avatar from multi-view videos. While prior art usually solves these problems separately, we argue that joint optimization of skeletal motion with a 3D renderable body model brings synergistic effects, i.e. yields more precise motion capture and improved visual quality of real-time rendering of avatars. To achieve this, we introduce a novel animatable avatar with 3D Gaussians rigged on a personalized mesh and propose to optimize the motion sequence with time-dependent MLPs that provide accurate and temporally consistent pose estimates. We first evaluate our method on highly challenging yoga poses and demonstrate state-of-the-art accuracy on multi-view human pose estimation, reducing error by 35% on body joints and 45% on hand joints compared to keypoint-based methods. At the same time, our method significantly boosts the visual quality of animatable avatars (+2dB PSNR on novel view synthesis) on diverse challenging subjects.

Better Together: Unified Motion Capture and 3D Avatar Reconstruction

TL;DR

Better Together proposes a unified, differentiable framework that jointly optimizes human pose estimation and photorealistic 3D avatar reconstruction from multi-view video. The method couples a personalized SMPLx-based mesh with a 3D Gaussian layer and time-dependent Motion MLPs to render and refine pose-consistent avatars, enabling dense, pixel-level supervision beyond keypoints. Experiments on MOYO and DNA-Rendering show state-of-the-art pose accuracy, particularly in body and hands, and notable improvements in novel-view avatar rendering quality. The work demonstrates the synergistic benefit of solving pose estimation and avatar rendering together, with practical implications for high-fidelity virtual humans and telepresence while highlighting areas for future efficiency and clothing handling improvements.

Abstract

We present Better Together, a method that simultaneously solves the human pose estimation problem while reconstructing a photorealistic 3D human avatar from multi-view videos. While prior art usually solves these problems separately, we argue that joint optimization of skeletal motion with a 3D renderable body model brings synergistic effects, i.e. yields more precise motion capture and improved visual quality of real-time rendering of avatars. To achieve this, we introduce a novel animatable avatar with 3D Gaussians rigged on a personalized mesh and propose to optimize the motion sequence with time-dependent MLPs that provide accurate and temporally consistent pose estimates. We first evaluate our method on highly challenging yoga poses and demonstrate state-of-the-art accuracy on multi-view human pose estimation, reducing error by 35% on body joints and 45% on hand joints compared to keypoint-based methods. At the same time, our method significantly boosts the visual quality of animatable avatars (+2dB PSNR on novel view synthesis) on diverse challenging subjects.

Paper Structure

This paper contains 49 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Our method iteratively optimizes human poses with an animatable avatar to reconstruct images. We observe that photometric supervision enables to learn not only photorealistic avatars, but also highly accurate human motion.
  • Figure 2: Overview of the forward deformation process of our method. Given a timestep, Motion MLPs process pose parameters, which are used to compute skeleton joints transformations. Then, we compute pose-dependent personalization of the template mesh, that we deform with LBS. 3D Gaussians are then attached to triangles and rasterized into an image, which is compared to the captured frame.
  • Figure 3: Qualitative pose estimation results on MOYO. Comparison of mesh reprojections into images. All methods use Sapiens khirodkar2025sapiens keypoints. Meshes are rendered with Steiner Gaussians (Sec \ref{['sec:gaussians_layer']}), except ScoreHMR. Video comparison is provided in supplementary.
  • Figure 4: Novel view synthesis on DNA Rendering. We compare rendering quality for avatars trained with poses obtained from our method and with ground truth poses from DNA. Our parameters produce sharper rendering with details closer to GT.
  • Figure 5: Integration of our avatar in an external Gaussian splatting scene from Mip-NeRF360 barron2022mipnerf360.
  • ...and 1 more figures