Table of Contents
Fetching ...

STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

Jiankuo Zhao, Xiangyu Zhu, Zidu Wang, Zhen Lei

TL;DR

The paper tackles monocular 3D head avatar reconstruction by introducing UV-Adaptive Soft Binding, which enables non-rigid, texture-aware Gaussian deformation via UV-space feature offsets, and Temporal Adaptive Density Control, which uses FLAME-conditioned clustering and a fused perceptual error to densify Gaussians in transient regions. These components are integrated with a tailored training objective that balances geometry and texture and is optimized efficiently. Empirical results on four datasets show state-of-the-art detail recovery, better handling of occluded regions like mouth interiors and eyelids, and faster convergence. The work advances practical, high-fidelity, animated avatars from monocular video with robust cross- reenactment capabilities.

Abstract

Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they lack specialized strategies to handle frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image-based and geometric priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to shape and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces a novel fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.

STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

TL;DR

The paper tackles monocular 3D head avatar reconstruction by introducing UV-Adaptive Soft Binding, which enables non-rigid, texture-aware Gaussian deformation via UV-space feature offsets, and Temporal Adaptive Density Control, which uses FLAME-conditioned clustering and a fused perceptual error to densify Gaussians in transient regions. These components are integrated with a tailored training objective that balances geometry and texture and is optimized efficiently. Empirical results on four datasets show state-of-the-art detail recovery, better handling of occluded regions like mouth interiors and eyelids, and faster convergence. The work advances practical, high-fidelity, animated avatars from monocular video with robust cross- reenactment capabilities.

Abstract

Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they lack specialized strategies to handle frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image-based and geometric priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to shape and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces a novel fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.

Paper Structure

This paper contains 17 sections, 12 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: STAvatar proposes a Soft Binding framework and a Temporal Adaptive Density Control strategy to reconstruct high-fidelity 3D head avatars from monocular videos.
  • Figure 2: Limitations of existing research. (a) Hard binding forces Gaussians to remain relatively static within the triangle coordinate frames, thereby limiting their ability to capture fine-grained details. (b) Transiently visible regions, such as mouth interiors, often exhibit low average positional gradients, which impedes effective Gaussian densification. (c) The positional gradient only reflects geometric inconsistencies and often loses texture details, which hinders the addition of Gaussians in high-frequency regions.
  • Figure 3: Overview of STAvatar. (a) In addition to a fixed identity reference image and its UV position map, we further rasterize the vertex offsets between reference mesh and control mesh to obtain a UV displacement map as input. (b) We construct a dual-branch network to predict a feature offset map in UV space, from which an offset $\delta_i$ is sampled for each Gaussian $g_i$. This offset is added to the coarsely estimated parameters $\tilde{\theta}$ to get final parameters $\theta^*$. The final images are then rendered using Gaussian Splatting. (c) We first construct a perceptual error map by combining $\mathcal{L}_1$ map and $\mathcal{L}_\mathrm{d\text{-}ssim}$ map. Then, we estimate the 2D projection of each Gaussian $g_i$ using the recorded attributes, based on which the fused perceptual error is computed.
  • Figure 4: FLAME‑Conditioned Temporal Clustering. We cluster video frames into $K$ clusters and conduct ADC within each cluster's training.
  • Figure 5: Qualitative results of head avatar reconstruction. Our method recovers finer details and delicate structures such as wrinkles and teeth. Moreover, it produces clearer results in challenging regions like mouth interiors and eyelids.
  • ...and 4 more figures