Table of Contents
Fetching ...

FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

Yue Wu, Xuanhong Chen, Yufan Wu, Wen Li, Yuxi Lu, Kairui Feng

TL;DR

This work proposes FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model.

Abstract

Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT) featuring three key designs: First, a 3DGS transformer aggregating multi-frame cues while injecting initial 3D prompt to predict the corresponding registered canonical 3DGS representations; Second, multi-granular guidance encoding (camera pose, expression coefficient, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations without wasting input data as in previous works. This yields a quality-speed-tunable paradigm for highly usable 3D avatar modeling. Extensive experiments show that FastAvatar has a higher quality and highly competitive speed compared to existing methods.

FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

TL;DR

This work proposes FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model.

Abstract

Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT) featuring three key designs: First, a 3DGS transformer aggregating multi-frame cues while injecting initial 3D prompt to predict the corresponding registered canonical 3DGS representations; Second, multi-granular guidance encoding (camera pose, expression coefficient, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations without wasting input data as in previous works. This yields a quality-speed-tunable paradigm for highly usable 3D avatar modeling. Extensive experiments show that FastAvatar has a higher quality and highly competitive speed compared to existing methods.

Paper Structure

This paper contains 39 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Unlike existing 3D Avatar methods that can only process fixed-length data, FastAvatar achieves incremental reconstruction. It can strike a good balance between modeling quality and inference speed based on available data volume, delivering high-quality models with sufficient data while providing viable reconstruction results at high speed even with limited data.
  • Figure 2: The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT), which can flexibly process input data with varying expressions, poses, and camera angles, aggregating them into a high-precision 3DGS avatar model. This capability is enabled by several key designs: the interleaving of global attention and frame attention to register complex input data while encoding 3D positional prompts; multi-granular positional information encoding; and the use of landmark tracking loss and sliced fusion loss, allowing the model to smoothly and incrementally fuse additional input data.
  • Figure 3: We benchmark FastAvatar against representative optimization-based methods (MonoGaussian Avatar monogaussianavatar, GaussianAvatar DBLP:conf/cvpr/QianKS0GN24) and feedforward approaches (LAM LAM). Our results demonstrate the performance evolution across methods as the number of input views (referring to input images number) increases. Please zoom in for a better view.
  • Figure 4: Reconstruction quality as the number of input observations increases. More observations improve reconstruction quality.
  • Figure 5: Performance on longer input sequences. Starting from strong reconstructions using only the 16 sparse input frames, incorporating the compressed additional frames further enhances fine-grained details (e.g., the oral cavity, which is absent in most frames). While uniform sampling fails to achieve this improvement, feeding all frames leads to OOM.
  • ...and 11 more figures