Table of Contents
Fetching ...

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Ye Yuan, Xueting Li, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, Umar Iqbal

TL;DR

GAvatar tackles the challenge of generating animatable 3D avatars from text by marrying Gaussian Splatting with a primitive-based, pose-aware framework and a dedicated SDF-based mesh learning pathway. The approach introduces a primitive-attached Gaussian representation and neural implicit fields to stabilize the optimization of millions of Gaussians under high-variance losses like SDS, while an SDF-based pipeline regularizes geometry and enables high-quality textured mesh extraction. Key contributions include the primitive-based implicit Gaussian avatar, the SDF-driven opacity and mesh extraction, and the demonstrated ability to render at 100 fps at 1K resolution with diverse prompts. This results in scalable, high-fidelity avatars suitable for immersive applications in AR/VR, gaming, and synthetic data generation.

Abstract

Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper, we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions, addressing the limitations (e.g., flexibility and efficiency) imposed by mesh or NeRF-based representations. However, a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems, we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animation. Second, to stabilize and amortize the learning of millions of Gaussians, we propose to use neural implicit fields to predict the Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries and extract detailed meshes, we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method, GAvatar, enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality, and achieves extremely fast rendering (100 fps) at 1K resolution.

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

TL;DR

GAvatar tackles the challenge of generating animatable 3D avatars from text by marrying Gaussian Splatting with a primitive-based, pose-aware framework and a dedicated SDF-based mesh learning pathway. The approach introduces a primitive-attached Gaussian representation and neural implicit fields to stabilize the optimization of millions of Gaussians under high-variance losses like SDS, while an SDF-based pipeline regularizes geometry and enables high-quality textured mesh extraction. Key contributions include the primitive-based implicit Gaussian avatar, the SDF-driven opacity and mesh extraction, and the demonstrated ability to render at 100 fps at 1K resolution with diverse prompts. This results in scalable, high-fidelity avatars suitable for immersive applications in AR/VR, gaming, and synthetic data generation.

Abstract

Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper, we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions, addressing the limitations (e.g., flexibility and efficiency) imposed by mesh or NeRF-based representations. However, a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems, we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animation. Second, to stabilize and amortize the learning of millions of Gaussians, we propose to use neural implicit fields to predict the Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries and extract detailed meshes, we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method, GAvatar, enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality, and achieves extremely fast rendering (100 fps) at 1K resolution.
Paper Structure (19 sections, 10 equations, 15 figures, 2 tables)

This paper contains 19 sections, 10 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: GAvatar synthesizes high-fidelity 3D animatable avatars from text prompts. Our novel primitive-based implicit Gaussian representation enables efficient avatar animation (100 fps, 1K resolution) and also extracts a highly detailed mesh from learned 3D Gaussians.
  • Figure 2: Overview of GAvatar. We first generate the primitives $V_k{=}(P_k, R_k, S_k)$ in the rest pose $\tilde{\theta}$. Each primitive consists of $N_k$ 3D Gaussians with their position $p_k^i$, rotation $r_k^i$ and scaling $s_k^i$ defined in the primitive's local coordinate system. Next, we obtain the canonical positions, $\hat{p}^i_k(\tilde{\theta})$, of the Gaussians in the world coordinates by applying the global transforms of the primitives using Eq. \ref{['eq:gaussian']}. These positions are then used to query the color $c_k^i$, rotation $r_k^i$ and scaling $s_k^i$ of each Gaussian from a neural attribute field $\mathcal{H}_\phi$. Each Gaussian's SDF value is queried from a neural SDF $\mathcal{S}_\psi$ and is converted into the opacity $\sigma_k^i$ through a kernel function $\mathcal{K}$. The 3D Gaussians with the predicted attributes are then rasterized onto the camera view using Gaussian splatting to produce the RGB image $I$ and alpha image $I_\alpha$. We use DMTet shen2021deep to differentiably extract the mesh from the Gaussian SDF values and generate its normal map and silhouette for geometry regularization. For animating the avatar using any target pose $\theta$, we generate the primitives using the target pose and use them to transform the 3D Gaussians, before rasterizing the image. A method walkthrough is also provided in the supplementary https://youtu.be/PbCF1HzrKrs.
  • Figure 3: Generated avatars by our method and their mesh normals and texture meshes.
  • Figure 4: Comparison with the state-of-the-art methods. From top to bottom, the prompts used in each row are "a person dressed at the venice carnival", "a professional boxer" and "a bedouin dressed in white". Our method consistently produces the best quality avatars.
  • Figure 5: Ablation Studies.
  • ...and 10 more figures