FAGhead: Fully Animate Gaussian Head from Monocular Videos

Yixin Xuan; Xinyang Li; Gongxin Yao; Shiwei Zhou; Donghui Sun; Xiaoxin Chen; Yu Pan

FAGhead: Fully Animate Gaussian Head from Monocular Videos

Yixin Xuan, Xinyang Li, Gongxin Yao, Shiwei Zhou, Donghui Sun, Xiaoxin Chen, Yu Pan

TL;DR

FAGhead tackles monocular 3D head avatar reconstruction by decoupling identity and expression within a FLAME-based parametric head model and introducing a Point-based Learnable Representation Field (PLRF) of Gaussian points. A Transform Network deforms canonical PLRF geometry to frame-specific configurations, while alpha rendering with a dedicated loss enforces edge-accurate geometry, reducing artifacts on hair and shoulders. The PLRF densifies the facial representation by placing Gaussian points along triangle midlines with a learnable parameter $n \in [0,1]$, and adaptive density control enables dynamic refinement during training; an MLP $F_{\theta}$ predicts spatial residuals $\delta\mu_i$, $\delta s_i$, $\delta r_i$ conditioned on FLAME properties $\rho_i$. Extensive experiments on open datasets and captured data show state-of-the-art fidelity in reconstruction, robust novel-view synthesis, and realistic cross-identity reenactment, outperforming INSTA, GaussianAvatars, and FlashAvatar. This approach enables high-quality, controllable head avatars from monocular video with practical implications for VR, social communication, and digital human creation, while acknowledging limitations in oral cavity modeling and preprocessing sensitivity.

Abstract

High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works.

FAGhead: Fully Animate Gaussian Head from Monocular Videos

TL;DR

, and adaptive density control enables dynamic refinement during training; an MLP

predicts spatial residuals

conditioned on FLAME properties

. Extensive experiments on open datasets and captured data show state-of-the-art fidelity in reconstruction, robust novel-view synthesis, and realistic cross-identity reenactment, outperforming INSTA, GaussianAvatars, and FlashAvatar. This approach enables high-quality, controllable head avatars from monocular video with practical implications for VR, social communication, and digital human creation, while acknowledging limitations in oral cavity modeling and preprocessing sensitivity.

Abstract

Paper Structure (24 sections, 17 equations, 16 figures, 3 tables)

This paper contains 24 sections, 17 equations, 16 figures, 3 tables.

Introduction
Related Work
Scene Reconstruction and Novel View Synthesis
3D Parameter Head Model
3D Head Portrait Synthesis
Preliminary
Method
Data Preprocessing
Point-based Learnable Representation Field
Transform Network
Geometry Enhancement
Optimization Scheme
Experiments
Experiment Setup
Qualitative and Quantitative Comparison in Reconstruction
...and 9 more sections

Figures (16)

Figure 1: Given the monocular video, our proposed FAGhead approach is able to generate high-fidelity avatars and the corresponding alpha map. By leveraging the novel Point-based Learnable Representation Field, FAGhead ensures photorealistic reanimation and extends generalization to novel expressions and head poses.
Figure 2: FAGhead overview. before training, a Point-based Learnable Representation Field is established on the timestep 0 as canonical point-based field. Via the transform network that input the canonical global position and current FLAME parameters and output the deformation between canonical and transform space. Besides, we introduce the alpha rendering in order to eliminate the geometry mistake.
Figure 3: The pipeline of Point-based Learnable Representation Field initialization and growth. We allocate four Gaussian points of each triangle as initialization. During training, the positions of Gaussian points will be dynamically adjusted. Meanwhile, we adopt the adaptive density control and growth strategy, which adds and removes splats based on the viewspace positional gradient and the opacity of each Gaussian point.
Figure 4: Capturing details of our dataset.
Figure 5: Alpha rendering results of FAGHead.
...and 11 more figures

FAGhead: Fully Animate Gaussian Head from Monocular Videos

TL;DR

Abstract

FAGhead: Fully Animate Gaussian Head from Monocular Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (16)