Table of Contents
Fetching ...

FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures

Yujie Sun, Zhuoqiang Cai, Chaoyue Niu, Jianchuan Chen, Zhiwen Chen, Chengfei Lv, Fan Wu

Abstract

We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouple two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.

FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures

Abstract

We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouple two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.
Paper Structure (52 sections, 17 equations, 12 figures, 4 tables)

This paper contains 52 sections, 17 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: FHAvatar generates high-fidelity, animatable 3D head avatars from arbitrary inputs, such as a few phone shots, in just minutes. The reconstructed heads contain composable face and hair components, enabling hairstyle transfer and convenient stylized editing.
  • Figure 2: Pipeline Overview. FHAvatar reconstructs a compositional face-and-hair 3D Gaussian head in the UV space. Our model starts with encoding image, hair, and face tokens from arbitrary input images and a template head mesh (\ref{['sec:encoder']}), which are fed into the aggregated transformer backbone to perform attention-based multi-view feature aggregation (\ref{['sec:transformer']}). The dual-branch decoders then independently decode planar Gaussians for the face and strand-based Gaussians for the hair at UV pixels, which are combined for real-time rendering under novel views and expressions (\ref{['sec:dual-branch-decoder']}).
  • Figure 3: Qualitative Comparison on reconstructing unseen identities from both in-the-wild data and the NeRSemble dataset under different capture conditions. LAM struggles to preserve identity similarity, while GAGAvatar and DiffusionRig fail to maintain accurate control under novel expressions or viewpoints. Optimization-based methods such as GaussianAvatars, FlashAvatar, and MeGA often fail to fit under sparse inputs. In contrast, our method delivers high rendering quality, supports accurate expression reenactment, and maintains consistent identity.
  • Figure 4: Ablation Study. Best viewed with zoom-in.
  • Figure 5: Reconstruction quality improves with additional frames, helping the model capture finer details. Note that for varying input quantities, we perform refinement over the same epochs.
  • ...and 7 more figures