InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars
Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Jinli Suo, Yebin Liu
TL;DR
The paper addresses the challenge of producing high-fidelity digital head avatars efficiently from monocular images. It proposes Incremental 3D GAN Inversion, combining a compact animatable 3D GAN prior, a UV-aligned neural texture encoder, and ConvGRU-based temporal fusion to progressively leverage multiple frames. Key contributions include a two-stage one-shot inversion into $W+$ latent and canonical texture/tri-plane spaces, a UV-aligned texture encoding strategy, and recurrent temporal aggregation that generalizes to varying frame counts, yielding state-of-the-art results in one-shot and few-shot avatar animation. The approach enables rapid, photorealistic avatar reconstruction suitable for interactive AR/VR and telepresence, while also acknowledging limitations tied to the expressive range of the parametric model and potential societal risks such as deepfakes.
Abstract
While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, \textbf{Incremental 3D GAN Inversion}, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks. Code will be available at https://github.com/XChenZ/invertAvatar.
