Table of Contents
Fetching ...

From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors

Ding-Jiun Huang, Yuanhao Wang, Shao-Ji Yuan, Albert Mosella-Montoro, Francisco Vicente Carrasco, Cheng Zhang, Fernando De la Torre

TL;DR

SuperHead addresses the challenge of enriching low-quality, animatable 3D head avatars by leveraging strong 3D generative priors through a dynamics-aware inversion framework. It combines multi-view 3D GAN inversion to produce a high-fidelity static Gaussian head, rigidly binds it to a FLAME mesh for animation, and then applies dynamics-aware refinement using multi-expression anchor renderings to sustain identity and detail under motion. Key contributions include a novel pipeline that integrates 3D GAN priors with explicit depth supervision and mesh refinement, producing temporally coherent, photorealistic avatars under diverse expressions and viewpoints. The approach achieves state-of-the-art visual quality on NeRSemble and INSTA benchmarks and demonstrates generalization across different Gaussian-based head models, with practical implications for VR, telepresence, and digital entertainment, while acknowledging limitations in back-head synthesis and full 360-degree coverage.

Abstract

Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.

From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors

TL;DR

SuperHead addresses the challenge of enriching low-quality, animatable 3D head avatars by leveraging strong 3D generative priors through a dynamics-aware inversion framework. It combines multi-view 3D GAN inversion to produce a high-fidelity static Gaussian head, rigidly binds it to a FLAME mesh for animation, and then applies dynamics-aware refinement using multi-expression anchor renderings to sustain identity and detail under motion. Key contributions include a novel pipeline that integrates 3D GAN priors with explicit depth supervision and mesh refinement, producing temporally coherent, photorealistic avatars under diverse expressions and viewpoints. The approach achieves state-of-the-art visual quality on NeRSemble and INSTA benchmarks and demonstrates generalization across different Gaussian-based head models, with practical implications for VR, telepresence, and digital entertainment, while acknowledging limitations in back-head synthesis and full 360-degree coverage.

Abstract

Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.
Paper Structure (18 sections, 12 equations, 12 figures, 4 tables)

This paper contains 18 sections, 12 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: We present SuperHead, a new method for super-resolving low-resolution 3D head avatars. Given a low-resolution animatable avatar reconstructed from low-quality captures, SuperHead synthesizes high-fidelity geometry and detailed textures while ensuring multi-view and temporal consistency under diverse facial expressions. Unlike prior image- or video-based SR approaches, our method directly upsamples 3D avatars, enabling photorealistic animation and faithful identity preservation from degraded inputs.
  • Figure 2: (a) Typical 3DGS process optimizes a head from multi-view images. (b) 3D GAN inversion aims to find a latent code whose generated 3D head best explains the given multi-view images. (c) When input multi-views are inconsistent, typical 3DGS tends to create an "averaged" result as a compromise among inconsistent inputs. 3D GAN inversion can generate high-frequency details because it searches in a pre-trained high-resolution space.
  • Figure 3: Overview of SuperHead. Given a low-resolution 3D head avatar driven by a morphable model, we first reconstruct static 3D head in the canonical space with multi-view 3D GAN inversion (Section \ref{['multi_view_3d_inversion']}). We then refine mesh geometry and rig 3D Gaussians onto mesh surface to enable animation (Section \ref{['3d_gaussian_rigging']}). We further include anchor images with diverse camera poses and expressions for dynamics-aware 3D refinement, ensuring the robustness of the 3D head model across viewing angles and complex facial motions (Section \ref{['dynamics_aware_inversion']}).
  • Figure 4: Example of sampled anchor images and the corresponding FLAME mesh used for 3D GAN inversion.
  • Figure 5: Qualitative comparisons on the NeRSemble dataset kirschstein2023nersemble. SuperHead synthesizes high-quality facial details across diverse expressions, clearly outperforming baselines and in some cases approaching the pseudo ground-truth head avatar.
  • ...and 7 more figures