Table of Contents
Fetching ...

SEGA: Drivable 3D Gaussian Head Avatar from a Single Image

Chen Guo, Zhuo Su, Jian Wang, Shuang Li, Xu Chang, Zhaohu Li, Yang Zhao, Guidong Wang, Ruqi Huang

TL;DR

SEGA tackles one-shot drivable 3D head avatars by fusing 2D identity priors with 3D priors in a hierarchical UV-space Gaussian Splatting framework. It disentangles identity and expression through a 2D VQ-VAE identity code and a displacement VAE, featuring static and dynamic Gaussian decoders operating on a FLAME-deformed UV map to enable real-time animation. The method is trained in two stages with a final personalization step, and demonstrates superior generalization and fidelity on NeRSemble compared to state-of-the-art methods, along with robust cross-identity reenactment across data sources. This work advances practical one-shot avatar creation for VR/telepresence by delivering robust cross-view consistency and expressive realism while acknowledging limitations and future improvements in accessories, hair, and lighting handling.

Abstract

Creating photorealistic 3D head avatars from limited input has become increasingly important for applications in virtual reality, telepresence, and digital entertainment. While recent advances like neural rendering and 3D Gaussian splatting have enabled high-quality digital human avatar creation and animation, most methods rely on multiple images or multi-view inputs, limiting their practicality for real-world use. In this paper, we propose SEGA, a novel approach for Single-imagE-based 3D drivable Gaussian head Avatar creation that combines generalized prior models with a new hierarchical UV-space Gaussian Splatting framework. SEGA seamlessly combines priors derived from large-scale 2D datasets with 3D priors learned from multi-view, multi-expression, and multi-ID data, achieving robust generalization to unseen identities while ensuring 3D consistency across novel viewpoints and expressions. We further present a hierarchical UV-space Gaussian Splatting framework that leverages FLAME-based structural priors and employs a dual-branch architecture to disentangle dynamic and static facial components effectively. The dynamic branch encodes expression-driven fine details, while the static branch focuses on expression-invariant regions, enabling efficient parameter inference and precomputation. This design maximizes the utility of limited 3D data and achieves real-time performance for animation and rendering. Additionally, SEGA performs person-specific fine-tuning to further enhance the fidelity and realism of the generated avatars. Experiments show our method outperforms state-of-the-art approaches in generalization ability, identity preservation, and expression realism, advancing one-shot avatar creation for practical applications.

SEGA: Drivable 3D Gaussian Head Avatar from a Single Image

TL;DR

SEGA tackles one-shot drivable 3D head avatars by fusing 2D identity priors with 3D priors in a hierarchical UV-space Gaussian Splatting framework. It disentangles identity and expression through a 2D VQ-VAE identity code and a displacement VAE, featuring static and dynamic Gaussian decoders operating on a FLAME-deformed UV map to enable real-time animation. The method is trained in two stages with a final personalization step, and demonstrates superior generalization and fidelity on NeRSemble compared to state-of-the-art methods, along with robust cross-identity reenactment across data sources. This work advances practical one-shot avatar creation for VR/telepresence by delivering robust cross-view consistency and expressive realism while acknowledging limitations and future improvements in accessories, hair, and lighting handling.

Abstract

Creating photorealistic 3D head avatars from limited input has become increasingly important for applications in virtual reality, telepresence, and digital entertainment. While recent advances like neural rendering and 3D Gaussian splatting have enabled high-quality digital human avatar creation and animation, most methods rely on multiple images or multi-view inputs, limiting their practicality for real-world use. In this paper, we propose SEGA, a novel approach for Single-imagE-based 3D drivable Gaussian head Avatar creation that combines generalized prior models with a new hierarchical UV-space Gaussian Splatting framework. SEGA seamlessly combines priors derived from large-scale 2D datasets with 3D priors learned from multi-view, multi-expression, and multi-ID data, achieving robust generalization to unseen identities while ensuring 3D consistency across novel viewpoints and expressions. We further present a hierarchical UV-space Gaussian Splatting framework that leverages FLAME-based structural priors and employs a dual-branch architecture to disentangle dynamic and static facial components effectively. The dynamic branch encodes expression-driven fine details, while the static branch focuses on expression-invariant regions, enabling efficient parameter inference and precomputation. This design maximizes the utility of limited 3D data and achieves real-time performance for animation and rendering. Additionally, SEGA performs person-specific fine-tuning to further enhance the fidelity and realism of the generated avatars. Experiments show our method outperforms state-of-the-art approaches in generalization ability, identity preservation, and expression realism, advancing one-shot avatar creation for practical applications.

Paper Structure

This paper contains 24 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We introduce SEGA, a novel approach for reconstructing photorealistic 3D Gaussian splats of a human head from a single image. Once the avatar is generated, SEGA enables free-viewpoint rendering, as well as self and cross-identity reenactment in real-time.
  • Figure 2: Overview of SEGA method. Our network consists of an identity encoder, a deformation VAE, and hierarchical Gaussian decoders. The identity encoder extracts identity features from a single RGB image into the VQ-VAE latent space (\ref{['sec:identity']}). The deformation VAE regresses the displacement map for the FLAME mesh and extracts the expression feature as its latent vector (\ref{['sec:deformation']}). The static and dynamic Gaussian decoders transform identity and expression features into 3D Gaussian parameter maps, which are bound to the deformed mesh surface (\ref{['sec:gaussian_prediction']}). During training, we freeze the identity encoder pre-trained on 2D data and sequentially train the two Gaussian decoders with the deformation VAE on the 3D dataset, respectively (\ref{['sec:training']}). For avatar creation, we fine-tune the decoders on the input image and precompute the static Gaussian parameter map for expression-independent regions. During real-time animation, given expression sequence, we only run the VAE for deformation maps and the dynamic decoder for facial-region Gaussian maps. These parameters are fused with the static map to render the final head avatar efficiently and accurately.
  • Figure 3: Qualitative comparison on NeRSemble dataset. SEGA demonstrates superior facial expression/pose fidelity and identity preservation compared to SOTA methods, including GOHA li2024generalizable, GPAvatar chu2024gpavatar, VOODOO3D tran2023voodoo, PT4Dv1 deng2024portrait4d, PT4Dv2 deng2024portrait4dv2 and GAGAvatar chu2024generalizable.
  • Figure 4: Cross-identity reenactment across data sources. Our method transfers expressions while preserving identity across NeRSemble (cols 1–4), studio-captured (cols 5–6), and in-the-wild data (cols 7–8), demonstrating strong disentanglement and generalization.
  • Figure 5: Ablation study evaluating different configurations. This includes the removal of individual loss terms such as the perceptual VGG loss and the ID loss, architectural variants such as using a fully static or fully dynamic model, as well as training strategies such as excluding the 2D prior or skipping the fine-tuning stage.
  • ...and 4 more figures