SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Zhijing Shao; Zhaolong Wang; Zhuang Li; Duotun Wang; Xiangru Lin; Yu Zhang; Mingming Fan; Zeyu Wang

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, Zeyu Wang

TL;DR

SplattingAvatar tackles real-time photorealistic humanoid avatars from monocular video by coupling explicit mesh motion with implicit Gaussian Splatting rendered on a triangle mesh. It introduces a trainable embedding $E=(k,u,v,d)$ that defines Gaussian positions with mean $μ = P + d oldsymbol{n}$ on the mesh, and uses lifted optimization to jointly refine Gaussian parameters and embeddings as the mesh deforms. The method achieves real-time performance (over $300$ FPS on a RTX 3090 and ~ $30$ FPS on an iPhone 13) with state-of-the-art rendering quality on head and full-body datasets, particularly excelling in eyes, hair, and off-surface geometry. It emphasizes editability, compatibility with common animation pipelines, and portability, while acknowledging limitations related to clothes and hair disentanglement and proposing future work in richer mesh representations.

Abstract

We present SplattingAvatar, a hybrid 3D representation of photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh, which renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We disentangle the motion and appearance of a virtual human with explicit mesh geometry and implicit appearance modeling with Gaussian Splatting. The Gaussians are defined by barycentric coordinates and displacement on a triangle mesh as Phong surfaces. We extend lifted optimization to simultaneously optimize the parameters of the Gaussians while walking on the triangle mesh. SplattingAvatar is a hybrid representation of virtual humans where the mesh represents low-frequency motion and surface deformation, while the Gaussians take over the high-frequency geometry and detailed appearance. Unlike existing deformation methods that rely on an MLP-based linear blend skinning (LBS) field for motion, we control the rotation and translation of the Gaussians directly by mesh, which empowers its compatibility with various animation techniques, e.g., skeletal animation, blend shapes, and mesh editing. Trainable from monocular videos for both full-body and head avatars, SplattingAvatar shows state-of-the-art rendering quality across multiple datasets.

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

TL;DR

that defines Gaussian positions with mean

on the mesh, and uses lifted optimization to jointly refine Gaussian parameters and embeddings as the mesh deforms. The method achieves real-time performance (over

FPS on a RTX 3090 and ~

FPS on an iPhone 13) with state-of-the-art rendering quality on head and full-body datasets, particularly excelling in eyes, hair, and off-surface geometry. It emphasizes editability, compatibility with common animation pipelines, and portability, while acknowledging limitations related to clothes and hair disentanglement and proposing future work in richer mesh representations.

Abstract

Paper Structure (12 sections, 15 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 12 sections, 15 equations, 14 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Method
Experiments
Datasets
Comparison with SoTA
Ablation Study
Discussion
Conclusion
Dataset
Implementation Details
Additional Results

Figures (14)

Figure 1: Overview of SplattingAvatar featuring Mesh-Embedded Gaussian Splatting. Our method takes (a) monocular videos as input, while employing (b) a trainable embedding technique for Gaussian-Mesh association. (c) Animated by mesh through the learned embedding, the Gaussians render into high-fidelity human avatars. (d) SplattingAvatar demonstrates real-time rendering capabilities in Unity, achieving over 300 FPS on an NVIDIA RTX 3090 GPU and 30 FPS on an iPhone 13 (images captured in action).
Figure 2: The pipeline of our method. SplattingAvatar learns 3D Gaussians with trainable embedding on the canonical mesh. The motion and deformation of the mesh explicitly bring the Gaussians to the posed space for differentiable rasterization. Both the Gaussians and embedding parameters are optimized during training. The position ${\mu}$ is the barycentric point $P$ plus a displacement $d$ along the interpolated normal vector $\boldsymbol{n}$. Pose-dependent quaternion and scaling $(\delta{{q}}, \delta{s})$ and pose-invariant quaternion, scaling, opacity, and color $(\overline{{q}}, \overline{s}, o, c)$ together define the properties of the Gaussians.
Figure 3: The development of Gaussian embeddings on mesh. Each line segment indicates the position of one Gaussian displaced from its embedding point on mesh. Gaussians for off-surface geometries like the hair have positive displacements while others turn to have negative displacements because when the mesh surface is correctly aligned to the geometry like in the facial area, the means for the Gaussians will be inside the mesh.
Figure 4: Walking on triangles for embedding update. a) The recursion process of walking on a triangle mesh. b) The update $P + \delta$ starting from triangle CAB is re-expressed as $P' + \delta'$ in triangle DBA, and c) re-expressed again in EDA. The re-expression between two triangles is conducted by conceptually treating them as two right triangles adjacent to each other on the hypotenuse.
Figure 5: Qualitative comparison on head avatars. SplattingAvatar produces photorealistic rendering for avatars with high-quality details especially in the eye and hair regions. Even the light reflection on the glasses is well reconstructed. Both PointAvatar PointAvatar and NHA neural_head_avatars can reconstruct good geometries but the rendering quality is limited by their underlying representations, i.e., points and texture atlas respectively. Compared to INSTA INSTA, our trainable embedding scheme produces better quality for off-surface geometries, especially for the glasses. The green arrows highlight where our results have better consistency with Ground Truth, while the red arrows point to where other methods show significant artifacts or noise. Please see the supplemental materials for illustrations of the error map.
...and 9 more figures

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

TL;DR

Abstract

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Authors

TL;DR

Abstract

Table of Contents

Figures (14)