Table of Contents
Fetching ...

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Jinli Suo, Yebin Liu

TL;DR

The paper addresses the challenge of producing high-fidelity digital head avatars efficiently from monocular images. It proposes Incremental 3D GAN Inversion, combining a compact animatable 3D GAN prior, a UV-aligned neural texture encoder, and ConvGRU-based temporal fusion to progressively leverage multiple frames. Key contributions include a two-stage one-shot inversion into $W+$ latent and canonical texture/tri-plane spaces, a UV-aligned texture encoding strategy, and recurrent temporal aggregation that generalizes to varying frame counts, yielding state-of-the-art results in one-shot and few-shot avatar animation. The approach enables rapid, photorealistic avatar reconstruction suitable for interactive AR/VR and telepresence, while also acknowledging limitations tied to the expressive range of the parametric model and potential societal risks such as deepfakes.

Abstract

While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, \textbf{Incremental 3D GAN Inversion}, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks. Code will be available at https://github.com/XChenZ/invertAvatar.

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

TL;DR

The paper addresses the challenge of producing high-fidelity digital head avatars efficiently from monocular images. It proposes Incremental 3D GAN Inversion, combining a compact animatable 3D GAN prior, a UV-aligned neural texture encoder, and ConvGRU-based temporal fusion to progressively leverage multiple frames. Key contributions include a two-stage one-shot inversion into latent and canonical texture/tri-plane spaces, a UV-aligned texture encoding strategy, and recurrent temporal aggregation that generalizes to varying frame counts, yielding state-of-the-art results in one-shot and few-shot avatar animation. The approach enables rapid, photorealistic avatar reconstruction suitable for interactive AR/VR and telepresence, while also acknowledging limitations tied to the expressive range of the parametric model and potential societal risks such as deepfakes.

Abstract

While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, \textbf{Incremental 3D GAN Inversion}, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks. Code will be available at https://github.com/XChenZ/invertAvatar.
Paper Structure (39 sections, 3 equations, 12 figures, 5 tables)

This paper contains 39 sections, 3 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: The architecture review of Next3D, along with the visualization of the modifications in our adopted 3D generative model.
  • Figure 2: The left part illustrates our two-stage avatar reconstruction pipeline. The coarse stage "Latent Space Inversion" (Sec3.3) inverts the first frame in the GAN prior's $W+$ latent space with $E_{latent}$, forming an initial avatar. The fine stage performs offset prediction in canonical feature spaces (Sec3.3) and we specifically design recurrent networks $E_{tex\_rec}$ and $E_{tri\_rec}$ to aggregate temporal information, incrementally refining a high-fidelity avatar (Sec3.4). The architecture of our advanced animatable 3D generative model is depicted in the right box.
  • Figure 3: Compare our animatable 3D generative model with Next3D sun2023next3d. We extract frames from a driving video clip and use the estimated facial model parameters to animate sampled random virtual avatars. Please zoom in and also refer to our video for more clear comparisons.
  • Figure 4: With a continuous input video stream, our method incrementally refines facial shape and texture details, in contrast to the fixed-window baseline "ConvFusion_avg", which tends to produce blurry outcomes.
  • Figure 5: LPIPS over the number of input frames on VFHQ-test. With longer sequences, our method shows improving and converging metrics, affirming its proficiency in long-term temporal aggregation, unlike fixed-window baselines that degrade as source image count consistently increases.
  • ...and 7 more figures