High-Fidelity Facial Albedo Estimation via Texture Quantization

Zimin Ran; Xingyu Ren; Xiang An; Kaicheng Yang; Xiangzi Dai; Ziyong Feng; Jia Guo; Linchao Zhu; Jiankang Deng

High-Fidelity Facial Albedo Estimation via Texture Quantization

Zimin Ran, Xingyu Ren, Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, Jia Guo, Linchao Zhu, Jiankang Deng

TL;DR

HiFiAlbedo tackles the challenge of high-fidelity facial albedo reconstruction from monocular images without relying on captured albedo data. It builds a high-quality facial texture codebook from large-scale, high-resolution RGB faces and uses a dual-discriminator UV texture reconstruction followed by a cross-attention mechanism with a group identity loss to map textures to unbiased albedo latent representations. The approach supports multi-image inference to mitigate illumination–albedo ambiguity and demonstrates competitive performance on the FAIR benchmark with strong generalization to real-world imagery. This self-supervised pipeline reduces data requirements while enabling realistic rendering and robust albedo recovery across diverse identities.

Abstract

Recent 3D face reconstruction methods have made significant progress in shape estimation, but high-fidelity facial albedo reconstruction remains challenging. Existing methods depend on expensive light-stage captured data to learn facial albedo maps. However, a lack of diversity in subjects limits their ability to recover high-fidelity results. In this paper, we present a novel facial albedo reconstruction model, HiFiAlbedo, which recovers the albedo map directly from a single image without the need for captured albedo data. Our key insight is that the albedo map is the illumination invariant texture map, which enables us to use inexpensive texture data to derive an albedo estimation by eliminating illumination. To achieve this, we first collect large-scale ultra-high-resolution facial images and train a high-fidelity facial texture codebook. By using the FFHQ dataset and limited UV textures, we then fine-tune the encoder for texture reconstruction from the input image with adversarial supervision in both image and UV space. Finally, we train a cross-attention module and utilize group identity loss to learn the adaptation from facial texture to the albedo domain. Extensive experimentation has demonstrated that our method exhibits excellent generalizability and is capable of achieving high-fidelity results for in-the-wild facial albedo recovery. Our code, pre-trained weights, and training data will be made publicly available at https://hifialbedo.github.io/.

High-Fidelity Facial Albedo Estimation via Texture Quantization

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 20 figures, 4 tables)

This paper contains 17 sections, 10 equations, 20 figures, 4 tables.

Introduction
Related Work
Methodology
Facial Texture Codebook Learning
UV Texture Reconstruction
High-fidelity Albedo Disambiguation
Experiments
Implementation Details
FAIR Benchmark Results
Real-World Results
Ablation Studies
Conclusion
Acknowledgements
Datasets
More Ablation Studies
...and 2 more sections

Figures (20)

Figure 1: Visualization of the reconstruction results based on the codebook. Given facial images, textures, or albedos as inputs, our pre-trained VQGAN consistently achieves high-fidelity reconstruction results.
Figure 2: Overview of our UV texture reconstruction pipeline. After training a VQ-based auto-encoder (blue box), we fine-tune the encoder and propose a dual discriminator (pink box). UV texture reconstruction from a single image is achieved by adversarial supervision in both latent and image space.
Figure 3: Overview of the proposed method HiFiAlbedo. Our core insight is that an individual shares the same albedo in different scenes. Therefore, we propose a group identity loss for unsupervised domain adaptation from texture to albedo. Specifically, we first sample the faces of the same person with similar attributes as input and then extract features by using the encoder trained in Sec. \ref{['Sec:Texture']}. These features are projected separately, and then cross-attention is computed using the learnable query latent to obtain the shared albedo. Finally, the albedo is overlaid back onto the original image, and then the group identity loss is computed to obtain a high-fidelity face albedo.
Figure 4: Comparison on the FAIR benchmark feng2022towards. From left to right: input image, GANFIT Gecer19:ganfit, INORig Bai21, MGCNet shang2020self, Deep3D deng2019accurate, CEST Wen21, DECA Feng2021, TRUST feng2022towards, ID2Albedo ren2023improving, ours and ground-truth albedo rendering.
Figure 5: Comparisons on in-the-wild images. From top to bottom: inputs, ours, ID2Albedo ren2023improving and TRUST feng2022towards albedo and rendered images. We achieve the most realistic rendered results.
...and 15 more figures

High-Fidelity Facial Albedo Estimation via Texture Quantization

TL;DR

Abstract

High-Fidelity Facial Albedo Estimation via Texture Quantization

Authors

TL;DR

Abstract

Table of Contents

Figures (20)