Expressive Gaussian Human Avatars from Monocular RGB Video

Hezhen Hu; Zhiwen Fan; Tianhao Wu; Yihan Xi; Seoyoung Lee; Georgios Pavlakos; Zhangyang Wang

Expressive Gaussian Human Avatars from Monocular RGB Video

Hezhen Hu, Zhiwen Fan, Tianhao Wu, Yihan Xi, Seoyoung Lee, Georgios Pavlakos, Zhangyang Wang

TL;DR

This work tackles the challenge of producing expressive humanoid avatars from monocular RGB video, focusing on fine-grained hand and facial detail. It presents EVA, which fuses 3D Gaussian Splatting with the SMPL-X parametric model and introduces a plug-and-play SMPL-X alignment module, a context-aware adaptive density control mechanism, and a confidence-guided per-pixel loss to supervise the Gaussian optimization. The method achieves state-of-the-art results on both controlled (XHumans) and in-the-wild (UPB) benchmarks, with significant gains in LPIPS, particularly for hand and face regions, and demonstrates robust expressiveness in novel poses. These contributions enable realistic, pose-controllable avatars from a single RGB video, with potential impact for VR/AR, film, and interactive media, while acknowledging limitations related to clothing, hair, and potential misuse in synthetic content.

Abstract

Nuanced expressiveness, particularly through fine-grained hand and facial expressions, is pivotal for enhancing the realism and vitality of digital human representations. In this work, we focus on investigating the expressiveness of human avatars when learned from monocular RGB video; a setting that introduces new challenges in capturing and animating fine-grained details. To this end, we introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X, an expressive parametric human model. Focused on enhancing expressiveness, our work makes three key contributions. First, we highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning. Recognizing the limitations of current SMPL-X prediction methods for in-the-wild videos, we introduce a plug-and-play module that significantly ameliorates misalignment issues. Second, we propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds to accommodate the varied granularity across body parts. Last but not least, we develop a feedback mechanism that predicts per-pixel confidence to better guide the learning of 3D Gaussians. Extensive experiments on two benchmarks demonstrate the superiority of our framework both quantitatively and qualitatively, especially on the fine-grained hand and facial details. See the project website at \url{https://evahuman.github.io}

Expressive Gaussian Human Avatars from Monocular RGB Video

TL;DR

Abstract

Paper Structure (15 sections, 15 equations, 4 figures, 3 tables)

This paper contains 15 sections, 15 equations, 4 figures, 3 tables.

Introduction
Related Work
Human Avatar Modeling from Monocular RGB Video
Expressive Human Representations
Technical Approach
Preliminaries
SMPL-X Alignment for Real-World Video
Context-aware Adaptive Density Control
Objective Functions
Experiments
Experimental Setup
Comparison with baselines
Ablation Studies
Limitations and Broader Impact
Conclusion

Figures (4)

Figure 1: Qualitative comparison on novel pose synthesis between SOTA method lei2023gart and our EVA method. Given a monocular real-world video, our EVA framework generates an expressive human avatar, outperforming the previous SOTA method, especially for the hand and facial details.
Figure 2: Overview of the proposed EVA framework. Given a real-world monocular RGB video, EVA first prepares well-aligned SMPL-X mesh via a plug-and-play module. Then EVA utilizes 3D Gaussians Splatting to perform avatar modeling, with the prior incorporated from the SMPL-X model. To improve the optimization, we propose context-aware adaptive density control and confidence-aware loss to improve the expressiveness of the avatar.
Figure 3: Qualitative comparison with baselines, including 3DGS kerbl20233d + SMPLX, GART lei2023gart + SMPLX, and GauHuman GauHuman + SMPLX on XHumans and UPB datasets. Our EVA model exhibits the best visual quality. See the zooming box for comparison of the fine-grained details.
Figure 4: Demonstration of the effectiveness of our SMPL-X alignment module. We can produce a SMPL-X mesh that aligns well with the RGB frame, especially for the fine-grained hand regions.

Expressive Gaussian Human Avatars from Monocular RGB Video

TL;DR

Abstract

Expressive Gaussian Human Avatars from Monocular RGB Video

Authors

TL;DR

Abstract

Table of Contents

Figures (4)