Table of Contents
Fetching ...

Towards Unsupervised Blind Face Restoration using Diffusion Prior

Tianshu Kuai, Sina Honari, Igor Gilitschenski, Alex Levinshtein

TL;DR

This paper utilizes a pre-trained diffusion model as a generative prior through which high quality images from the natural image distribution are generated while maintaining the input image content through consistency constraints, and achieves the state-of-the-art results on both synthetic and real-world datasets.

Abstract

Blind face restoration methods have shown remarkable performance, particularly when trained on large-scale synthetic datasets with supervised learning. These datasets are often generated by simulating low-quality face images with a handcrafted image degradation pipeline. The models trained on such synthetic degradations, however, cannot deal with inputs of unseen degradations. In this paper, we address this issue by using only a set of input images, with unknown degradations and without ground truth targets, to fine-tune a restoration model that learns to map them to clean and contextually consistent outputs. We utilize a pre-trained diffusion model as a generative prior through which we generate high quality images from the natural image distribution while maintaining the input image content through consistency constraints. These generated images are then used as pseudo targets to fine-tune a pre-trained restoration model. Unlike many recent approaches that employ diffusion models at test time, we only do so during training and thus maintain an efficient inference-time performance. Extensive experiments show that the proposed approach can consistently improve the perceptual quality of pre-trained blind face restoration models while maintaining great consistency with the input contents. Our best model also achieves the state-of-the-art results on both synthetic and real-world datasets.

Towards Unsupervised Blind Face Restoration using Diffusion Prior

TL;DR

This paper utilizes a pre-trained diffusion model as a generative prior through which high quality images from the natural image distribution are generated while maintaining the input image content through consistency constraints, and achieves the state-of-the-art results on both synthetic and real-world datasets.

Abstract

Blind face restoration methods have shown remarkable performance, particularly when trained on large-scale synthetic datasets with supervised learning. These datasets are often generated by simulating low-quality face images with a handcrafted image degradation pipeline. The models trained on such synthetic degradations, however, cannot deal with inputs of unseen degradations. In this paper, we address this issue by using only a set of input images, with unknown degradations and without ground truth targets, to fine-tune a restoration model that learns to map them to clean and contextually consistent outputs. We utilize a pre-trained diffusion model as a generative prior through which we generate high quality images from the natural image distribution while maintaining the input image content through consistency constraints. These generated images are then used as pseudo targets to fine-tune a pre-trained restoration model. Unlike many recent approaches that employ diffusion models at test time, we only do so during training and thus maintain an efficient inference-time performance. Extensive experiments show that the proposed approach can consistently improve the perceptual quality of pre-trained blind face restoration models while maintaining great consistency with the input contents. Our best model also achieves the state-of-the-art results on both synthetic and real-world datasets.
Paper Structure (40 sections, 13 equations, 23 figures, 27 tables, 8 algorithms)

This paper contains 40 sections, 13 equations, 23 figures, 27 tables, 8 algorithms.

Figures (23)

  • Figure 1: Overview. Given a restoration model pre-trained on synthetic datasets in a supervised fashion, it can produce high-quality restoration on low-quality images that are aligned with the degradation distribution used in training (a). However, it often fails on inputs of out-of-distribution degradations (b). We propose an unsupervised pipeline to adapt a pre-trained model to unpaired degraded images of the target degradation with a much smaller data size. This addresses the domain gap in degradation types without paired ground-truth images or the knowledge of the target data's degradation type (c). (zoom in for details).
  • Figure 2: Overview of our unsupervised fine-tuning pipeline. Given a pre-trained restoration model that produces low-quality restoration outputs (severe artifacts on hair and over-smoothed skin) on samples with unknown and out-of-distribution degradations, we generate pseudo targets using a pre-trained unconditional diffusion model with a combination of low frequency content constrained denoising and unconditional denoising. The generated clean images can be used as pseudo GT to fine-tune the pre-trained restoration model without the need for real GT images.
  • Figure 3: Visualization of low frequency contents at different timesteps. We show low frequency contents of the low-quality restoration from a pre-trained SwinIR liang2021swinir and its GT counterparts at different timesteps of the forward diffusion process (zoom in for details).
  • Figure 4: Qualitative comparison of pre-trained versus fine-tuned models. We show the effectiveness of our proposed approach to a pre-trained SwinIR liang2021swinir and a pre-trained CodeFormer codeformer models on 4$\times$ and 8$\times$ downsampled data at moderate noise level. The fine-tuned models are able to produce realistic restoration (zoom in for details).
  • Figure 5: Qualitative comparison with SOTA baselines on synthetic datasets. Our fine-tuned CodeFormer model outperforms all other baselines and its pre-trained counterparts on severely degraded inputs from both 4$\times$ downsampling and 8$\times$ downsampling inputs (zoom in for details).
  • ...and 18 more figures