Table of Contents
Fetching ...

Towards Real-World Blind Face Restoration with Generative Diffusion Prior

Xiaoxu Chen, Jingfan Tan, Tao Wang, Kaihao Zhang, Wenhan Luo, Xiaochun Cao

TL;DR

This work addresses blind face restoration by leveraging a pretrained Stable Diffusion prior, introducing BFRffusion, a four-module restoration architecture, and PFHQ, a privacy-preserving balanced face dataset. The approach combines shallow degradation removal, multi-scale transformer-based feature extraction, trainable time-aware prompts, and a finetuned denoising U-Net to enable high-fidelity restoration in latent space, with $z_t$ denoising guided across time steps. Extensive experiments show state-of-the-art performance on synthetic and real-world datasets, with ablations validating each component’s contribution and PFHQ demonstrating competitive utility and privacy advantages. The work has practical impact by delivering a scalable, privacy-conscious path to high-quality face restoration and providing an accessible resource for training on balanced, synthetic face data.

Abstract

Blind face restoration is an important task in computer vision and has gained significant attention due to its wide-range applications. Previous works mainly exploit facial priors to restore face images and have demonstrated high-quality results. However, generating faithful facial details remains a challenging problem due to the limited prior knowledge obtained from finite data. In this work, we delve into the potential of leveraging the pretrained Stable Diffusion for blind face restoration. We propose BFRffusion which is thoughtfully designed to effectively extract features from low-quality face images and could restore realistic and faithful facial details with the generative prior of the pretrained Stable Diffusion. In addition, we build a privacy-preserving face dataset called PFHQ with balanced attributes like race, gender, and age. This dataset can serve as a viable alternative for training blind face restoration networks, effectively addressing privacy and bias concerns usually associated with the real face datasets. Through an extensive series of experiments, we demonstrate that our BFRffusion achieves state-of-the-art performance on both synthetic and real-world public testing datasets for blind face restoration and our PFHQ dataset is an available resource for training blind face restoration networks. The codes, pretrained models, and dataset are released at https://github.com/chenxx89/BFRffusion.

Towards Real-World Blind Face Restoration with Generative Diffusion Prior

TL;DR

This work addresses blind face restoration by leveraging a pretrained Stable Diffusion prior, introducing BFRffusion, a four-module restoration architecture, and PFHQ, a privacy-preserving balanced face dataset. The approach combines shallow degradation removal, multi-scale transformer-based feature extraction, trainable time-aware prompts, and a finetuned denoising U-Net to enable high-fidelity restoration in latent space, with denoising guided across time steps. Extensive experiments show state-of-the-art performance on synthetic and real-world datasets, with ablations validating each component’s contribution and PFHQ demonstrating competitive utility and privacy advantages. The work has practical impact by delivering a scalable, privacy-conscious path to high-quality face restoration and providing an accessible resource for training on balanced, synthetic face data.

Abstract

Blind face restoration is an important task in computer vision and has gained significant attention due to its wide-range applications. Previous works mainly exploit facial priors to restore face images and have demonstrated high-quality results. However, generating faithful facial details remains a challenging problem due to the limited prior knowledge obtained from finite data. In this work, we delve into the potential of leveraging the pretrained Stable Diffusion for blind face restoration. We propose BFRffusion which is thoughtfully designed to effectively extract features from low-quality face images and could restore realistic and faithful facial details with the generative prior of the pretrained Stable Diffusion. In addition, we build a privacy-preserving face dataset called PFHQ with balanced attributes like race, gender, and age. This dataset can serve as a viable alternative for training blind face restoration networks, effectively addressing privacy and bias concerns usually associated with the real face datasets. Through an extensive series of experiments, we demonstrate that our BFRffusion achieves state-of-the-art performance on both synthetic and real-world public testing datasets for blind face restoration and our PFHQ dataset is an available resource for training blind face restoration networks. The codes, pretrained models, and dataset are released at https://github.com/chenxx89/BFRffusion.
Paper Structure (33 sections, 15 equations, 11 figures, 7 tables)

This paper contains 33 sections, 15 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Representative face images from the proposed PFHQ dataset. These face images exhibit balanced race, gender, and age distribution.
  • Figure 2: Overview of the architecture of BFRffusion which consists of four modules. The shallow degradation removal module (SDRM) and the multi-scale feature extraction module (MFEM) remove shallow degradation and extract multi-scale features from low-quality face images. The pretrained denoising U-Net module (PDUM) utilizes multi-scale features and prompts from the trainable time-aware prompt module (TTPM) as conditions to predict the next step of noise based on the input noise. After multiple denoising steps, high-quality latent features are obtained, which are subsequently transformed into high-quality face images by the pretrained decoder. The MFEM is composed of several transformer blocks, whose structure is illustrated below the dashed line.
  • Figure 3: Visualization of feature maps learned by our multi-scale feature extraction module (MFEM) in different timesteps and resolutions. The first row demonstrates the capability of our MFEM to extract accurate features at any timesteps. The second row shows the multi-scale features extracted by our MFEM at various resolutions.
  • Figure 4: The pipeline of our face image generation process. We choose aligned face parsing maps as the input of the pipeline.
  • Figure 5: Visual results of modification to the face parsing maps. The first row shows examples of the face parsing maps and the second row shows corresponding image generation results. The modifications are as follows: (a) base, (b) adding earrings, (c) changing the hairstyle, (d) adding glasses, (e) adding a hat, (f) changing mouth style. Zoom in for best view.
  • ...and 6 more figures