Table of Contents
Fetching ...

FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation

Cheng Peng, Zhuo Su, Liao Wang, Chen Guo, Zhaohu Li, Chengjiang Long, Zheng Lv, Jingxiang Sun, Chenyangguang Zhang, Yebin Liu

TL;DR

FlexAvatar addresses the challenge of creating high-fidelity animatable 3D head avatars from 1–4 uncalibrated images without camera poses or expression labels. It combines a transformer-based flexible reconstruction backbone with Structured Head Query tokens and a UV-space UNet decoder to produce expression-aware Gaussian deformations in real time, aided by a distribution-adjusted training strategy and a 10-second refinement for personalization. The approach delivers superior 3D consistency and dynamic realism compared with prior methods and enables practical, real-time avatar animation from sparse inputs. Its design balances scalability from large reconstruction models with the efficiency of Gaussian splatting for interactive digital humans, while outlining directions to extend coverage to accessories, full-body, and relighting. Overall, FlexAvatar advances robust, real-time, pose- and expression-free head avatars suitable for telepresence, VR, and digital humans.

Abstract

We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.

FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation

TL;DR

FlexAvatar addresses the challenge of creating high-fidelity animatable 3D head avatars from 1–4 uncalibrated images without camera poses or expression labels. It combines a transformer-based flexible reconstruction backbone with Structured Head Query tokens and a UV-space UNet decoder to produce expression-aware Gaussian deformations in real time, aided by a distribution-adjusted training strategy and a 10-second refinement for personalization. The approach delivers superior 3D consistency and dynamic realism compared with prior methods and enables practical, real-time avatar animation from sparse inputs. Its design balances scalability from large reconstruction models with the efficiency of Gaussian splatting for interactive digital humans, while outlining directions to extend coverage to accessories, full-body, and relighting. Overall, FlexAvatar advances robust, real-time, pose- and expression-free head avatars suitable for telepresence, VR, and digital humans.

Abstract

We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.

Paper Structure

This paper contains 27 sections, 14 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Given single or sparse input images ( e.g., 1-4 images) without camera poses or expression labels, our proposed FlexAvatar produces detailed, real-time 360° reenactment renderings. Our feed-forward model already delivers high fidelity zero-shot results across diverse identities. An efficient refinement strategy (10 seconds) can be adopted to better match the input images.
  • Figure 2: FlexAvatar reconstructs a high-quality Gaussian head avatar by mapping a flexible number of input images with varying expressions and camera views into Gaussian representations in UV space. We use a flexible feed-forward backbone to obtain static Gaussian attributes and an identity feature map from input images. Given a driving expression signal, we then convert it into a FLAME UV position map and concatenate it with the backbone's identity feature map; to support real-time driving and produce high-quality dynamic results, the concatenated representation is then fed into a UNet to generate expression-dependent dynamic Gaussian attributes in UV space, which are then sampled into FLAME space with LBS for rendering. Optionally, an efficient refinement can further improve the results.
  • Figure 3: Qualitative comparisons with baseline methods. Our single-image feed-forward and finetuned results both outperform other methods in terms of 3D consistency and animation quality, especially on details like wrinkles or teeth. Please zoom in to see the details.
  • Figure 4: Ablation study. Using a position map as the driving signal (a) and performing Data Distribution Adjustment (b) substantially improve the control and realism of dynamic textures, such as oral cavities and wrinkles. Adding a mouth perception loss (c) increases the granular detail of teeth, yielding more complete dental appearance. Finally, applying a finetuning stage (d) further enhances consistency with the input image for widely varying person-specific attributes (e.g., hair, clothing).
  • Figure 5: Evaluation. (a) The model's feed-forward test metrics show significant improvement as the training ID number increases. (b) Increasing the number of training identities led to a significant improvement in convergence speed. (c) A higher proportion of marginal expressions (eye rolling, mouth opening, frowning) is sampled after data distribution adjustment.
  • ...and 11 more figures