From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors
Ding-Jiun Huang, Yuanhao Wang, Shao-Ji Yuan, Albert Mosella-Montoro, Francisco Vicente Carrasco, Cheng Zhang, Fernando De la Torre
TL;DR
SuperHead addresses the challenge of enriching low-quality, animatable 3D head avatars by leveraging strong 3D generative priors through a dynamics-aware inversion framework. It combines multi-view 3D GAN inversion to produce a high-fidelity static Gaussian head, rigidly binds it to a FLAME mesh for animation, and then applies dynamics-aware refinement using multi-expression anchor renderings to sustain identity and detail under motion. Key contributions include a novel pipeline that integrates 3D GAN priors with explicit depth supervision and mesh refinement, producing temporally coherent, photorealistic avatars under diverse expressions and viewpoints. The approach achieves state-of-the-art visual quality on NeRSemble and INSTA benchmarks and demonstrates generalization across different Gaussian-based head models, with practical implications for VR, telepresence, and digital entertainment, while acknowledging limitations in back-head synthesis and full 360-degree coverage.
Abstract
Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.
