Table of Contents
Fetching ...

FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

Hao Liang, Zhixuan Ge, Soumendu Majee, Ashish Tiwari, G. M. Dilshan Godaliyadda, Ashok Veeraraghavan, Guha Balakrishnan

TL;DR

FastAvatar tackles single-image 3D face reconstruction under unconstrained poses by introducing a template-based 3D Gaussian Splatting representation and a two-stage pipeline: a pose-invariant encoder–decoder predicts Gaussian residuals to deform a FLAME-aligned template, followed by a lightweight appearance refinement. This approach combines the stability of a strong geometric prior with targeted optimization to achieve high fidelity, reporting $PSNR$ $24.01$ dB and $SSIM$ $0.91$ in roughly $3$ seconds on an NVIDIA A100. The method also enables photorealistic novel-view synthesis and FLAME-guided expression animation, with demonstrated generalization to unseen identities and out-of-distribution subjects. Overall, FastAvatar delivers a practical, real-time solution that bridges fast feed-forward prediction and per-subject optimization, expanding the applicability of 3DGS-based facial avatars for interactive applications.

Abstract

We present FastAvatar, a fast and robust algorithm for single-image 3D face reconstruction using 3D Gaussian Splatting (3DGS). Given a single input image from an arbitrary pose, FastAvatar recovers a high-quality, full-head 3DGS avatar in approximately 3 seconds on a single NVIDIA A100 GPU. We use a two-stage design: a feed-forward encoder-decoder predicts coarse face geometry by regressing Gaussian structure from a pose-invariant identity embedding, and a lightweight test-time refinement stage then optimizes the appearance parameters for photorealistic rendering. This hybrid strategy combines the speed and stability of direct prediction with the accuracy of optimization, enabling strong identity preservation even under extreme input poses. FastAvatar achieves state-of-the-art reconstruction quality (24.01 dB PSNR, 0.91 SSIM) while running over 600x faster than existing per-subject optimization methods (e.g., FlashAvatar, GaussianAvatars, GASP). Once reconstructed, our avatars support photorealistic novel-view synthesis and FLAME-guided expression animation, enabling controllable reenactment from a single image. By jointly offering high fidelity, robustness to pose, and rapid reconstruction, FastAvatar significantly broadens the applicability of 3DGS-based facial avatars.

FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

TL;DR

FastAvatar tackles single-image 3D face reconstruction under unconstrained poses by introducing a template-based 3D Gaussian Splatting representation and a two-stage pipeline: a pose-invariant encoder–decoder predicts Gaussian residuals to deform a FLAME-aligned template, followed by a lightweight appearance refinement. This approach combines the stability of a strong geometric prior with targeted optimization to achieve high fidelity, reporting dB and in roughly seconds on an NVIDIA A100. The method also enables photorealistic novel-view synthesis and FLAME-guided expression animation, with demonstrated generalization to unseen identities and out-of-distribution subjects. Overall, FastAvatar delivers a practical, real-time solution that bridges fast feed-forward prediction and per-subject optimization, expanding the applicability of 3DGS-based facial avatars for interactive applications.

Abstract

We present FastAvatar, a fast and robust algorithm for single-image 3D face reconstruction using 3D Gaussian Splatting (3DGS). Given a single input image from an arbitrary pose, FastAvatar recovers a high-quality, full-head 3DGS avatar in approximately 3 seconds on a single NVIDIA A100 GPU. We use a two-stage design: a feed-forward encoder-decoder predicts coarse face geometry by regressing Gaussian structure from a pose-invariant identity embedding, and a lightweight test-time refinement stage then optimizes the appearance parameters for photorealistic rendering. This hybrid strategy combines the speed and stability of direct prediction with the accuracy of optimization, enabling strong identity preservation even under extreme input poses. FastAvatar achieves state-of-the-art reconstruction quality (24.01 dB PSNR, 0.91 SSIM) while running over 600x faster than existing per-subject optimization methods (e.g., FlashAvatar, GaussianAvatars, GASP). Once reconstructed, our avatars support photorealistic novel-view synthesis and FLAME-guided expression animation, enabling controllable reenactment from a single image. By jointly offering high fidelity, robustness to pose, and rapid reconstruction, FastAvatar significantly broadens the applicability of 3DGS-based facial avatars.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: FastAvatar produces high-quality 3D face avatars and animations from a single input image. Given an arbitrary-pose face image, FastAvatar reconstructs a complete 3D Gaussian Splatting (3DGS) representation and refines it using a geometry–appearance optimization routine requiring only $\sim$3 seconds on a single NVIDIA A100 GPU. Once reconstructed, the avatar supports photorealistic novel-view synthesis and smooth expression animation driven by FLAME-guided pose and expression controls, while preserving identity and rendering quality across all viewpoints.
  • Figure 2: FastAvatar framework.(a) Template 3DGS face model construction. FastAvatar constructs a template 3DGS face model $\mathcal{T}$ by averaging parameters of Gaussians across 3DGS models fit on a training set of subjects. (b) Encoder-Decoder Pipeline. FastAvatar uses an encoder-decoder architecture to map an input image to parameter offsets of the template 3DGS model constructed in (a). We train the decoder to predict parameter offsets for each Gaussian conditioned on subject-specific and Gaussian-specific embedding vectors. We train the encoder to map multi-pose images of the same identity to the same subject-specific embedding. At inference time, FastAvatar passes an image into the encoder to generate a subject-specific embedding, and decodes this embedding to obtain Gaussian-specific parameter offsets, that, combined with template $T$, yields a full 3DGS avatar in real time ($\leq 3$ seconds) with refinement.
  • Figure 3: Qualitative comparison on single-image novel-view synthesis. Given a single arbitrary-view input (left), we compare FastAvatar (base and full) with DiffusionRig ding2023diffusionrig, GAGAvatar chu2024generalizable, LAM he2025lam, Arc2Avatar gerogiannis2025arc2avatar, FlashAvatar xiang2024flashavatar, and GaussianAvatars (GA) qian2024gaussianavatars. GAGAvatar and Arc2Avatar operate in their own canonical spaces; following prior work, we align their outputs to our coordinate frame via PnP (details in Supplementary), though small residual shifts may remain. Diffusion-based and feed-forward baselines struggle under large input poses, often producing blurry textures, synthetic-looking faces, or distorted geometry. GA and FlashAvatar, which require multi-view fitting, degrade noticeably when extended to the single-view setting. In contrast, FastAvatar maintains coherent geometry and identity across wide viewpoint changes; the full model further sharpens appearance through a lightweight $3$-second refinement stage. Additional examples, including more poses, expressions, and identity-similarity metrics, are provided in the Supplementary.
  • Figure 4: Reconstruction quality vs. runtime. Ours (base) produces strong feed-forward reconstructions (21.17 dB PSNR, 0.89 SSIM), while Ours (full) achieves state-of-the-art quality (24.01 dB, 0.91 SSIM) with only $\sim$3 seconds of refinement.
  • Figure 5: Self- and cross-reenactment. Starting from a single reconstructed face (left), FastAvatar can reproduce expressions from the same subject (self) or transfer expressions from another subject (cross) by driving the Gaussians with FLAME parameters. Identity remains stable while expressions are well reproduced.
  • ...and 2 more figures