Table of Contents
Fetching ...

CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models

Maitreya Suin, Rama Chellappa

TL;DR

CLR-Face addresses blind face restoration under severe degradation by learning a highly expressive latent space within a Vector-Quantized Autoencoder and applying a conditional score-based diffusion prior to iteratively refine latent embeddings. An Identity Recovery Network (IRN) provides identity-focused guidance, gated by a learnable latent mask that preserves key identity features while maintaining perceptual quality. Across synthetic and real-world benchmarks, it delivers stronger identity preservation (IDS) and perceptual fidelity (lower LPIPS, FID) with faster inference than pixel-space diffusion baselines. The framework offers a scalable path toward high-fidelity, identity-consistent BFR and motivates future work on more efficient reverse solvers and recognition-free guidance strategies.

Abstract

Recent generative-prior-based methods have shown promising blind face restoration performance. They usually project the degraded images to the latent space and then decode high-quality faces either by single-stage latent optimization or directly from the encoding. Generating fine-grained facial details faithful to inputs remains a challenging problem. Most existing methods produce either overly smooth outputs or alter the identity as they attempt to balance between generation and reconstruction. This may be attributed to the typical trade-off between quality and resolution in the latent space. If the latent space is highly compressed, the decoded output is more robust to degradations but shows worse fidelity. On the other hand, a more flexible latent space can capture intricate facial details better, but is extremely difficult to optimize for highly degraded faces using existing techniques. To address these issues, we introduce a diffusion-based-prior inside a VQGAN architecture that focuses on learning the distribution over uncorrupted latent embeddings. With such knowledge, we iteratively recover the clean embedding conditioning on the degraded counterpart. Furthermore, to ensure the reverse diffusion trajectory does not deviate from the underlying identity, we train a separate Identity Recovery Network and use its output to constrain the reverse diffusion process. Specifically, using a learnable latent mask, we add gradients from a face-recognition network to a subset of latent features that correlates with the finer identity-related details in the pixel space, leaving the other features untouched. Disentanglement between perception and fidelity in the latent space allows us to achieve the best of both worlds. We perform extensive evaluations on multiple real and synthetic datasets to validate the superiority of our approach.

CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models

TL;DR

CLR-Face addresses blind face restoration under severe degradation by learning a highly expressive latent space within a Vector-Quantized Autoencoder and applying a conditional score-based diffusion prior to iteratively refine latent embeddings. An Identity Recovery Network (IRN) provides identity-focused guidance, gated by a learnable latent mask that preserves key identity features while maintaining perceptual quality. Across synthetic and real-world benchmarks, it delivers stronger identity preservation (IDS) and perceptual fidelity (lower LPIPS, FID) with faster inference than pixel-space diffusion baselines. The framework offers a scalable path toward high-fidelity, identity-consistent BFR and motivates future work on more efficient reverse solvers and recognition-free guidance strategies.

Abstract

Recent generative-prior-based methods have shown promising blind face restoration performance. They usually project the degraded images to the latent space and then decode high-quality faces either by single-stage latent optimization or directly from the encoding. Generating fine-grained facial details faithful to inputs remains a challenging problem. Most existing methods produce either overly smooth outputs or alter the identity as they attempt to balance between generation and reconstruction. This may be attributed to the typical trade-off between quality and resolution in the latent space. If the latent space is highly compressed, the decoded output is more robust to degradations but shows worse fidelity. On the other hand, a more flexible latent space can capture intricate facial details better, but is extremely difficult to optimize for highly degraded faces using existing techniques. To address these issues, we introduce a diffusion-based-prior inside a VQGAN architecture that focuses on learning the distribution over uncorrupted latent embeddings. With such knowledge, we iteratively recover the clean embedding conditioning on the degraded counterpart. Furthermore, to ensure the reverse diffusion trajectory does not deviate from the underlying identity, we train a separate Identity Recovery Network and use its output to constrain the reverse diffusion process. Specifically, using a learnable latent mask, we add gradients from a face-recognition network to a subset of latent features that correlates with the finer identity-related details in the pixel space, leaving the other features untouched. Disentanglement between perception and fidelity in the latent space allows us to achieve the best of both worlds. We perform extensive evaluations on multiple real and synthetic datasets to validate the superiority of our approach.
Paper Structure (14 sections, 11 equations, 6 figures, 5 tables)

This paper contains 14 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An overview of our inference (left) and training framework (right).
  • Figure 2: Qualitative comparisons on CelebA-Test set for BFR.
  • Figure 3: Qualitative comparisons on CelebA-Test set for $\times 32$ upsampling. Although the input is severely degraded, our approach works better than existing works in restoring the face faithfully.
  • Figure 4: Qualitative comparisons on real-world datasets. The first two rows represent images from WIDER face dataset, the third row represents images from WebPhoto, respectively.
  • Figure 5: Qualitative comparisons on real-world images from CelebA-Child for image colorization.
  • ...and 1 more figures