Table of Contents
Fetching ...

InterLCM: Low-Quality Images as Intermediate States of Latent Consistency Models for Effective Blind Face Restoration

Senmao Li, Kai Wang, Joost van de Weijer, Fahad Shahbaz Khan, Chun-Le Guo, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng

TL;DR

This work tackles blind face restoration under unknown degradations by addressing diffusion priors' weak semantic coherence and slow sampling. It introduces InterLCM, a framework that grounds restoration in latent consistency models and treats the low-quality input as an intermediate LCM state, enabling a few-step, semantically stable reconstruction. A Visual Module and a Spatial Encoder inject face-specific semantics and structural priors, and the training combines reconstruction, perceptual, and adversarial losses to improve fidelity. Across synthetic and real-world datasets, InterLCM achieves superior restoration quality with faster inference than traditional diffusion-based methods, demonstrating practical impact for real-world BFR tasks.

Abstract

Diffusion priors have been used for blind face restoration (BFR) by fine-tuning diffusion models (DMs) on restoration datasets to recover low-quality images. However, the naive application of DMs presents several key limitations. (i) The diffusion prior has inferior semantic consistency (e.g., ID, structure and color.), increasing the difficulty of optimizing the BFR model; (ii) reliance on hundreds of denoising iterations, preventing the effective cooperation with perceptual losses, which is crucial for faithful restoration. Observing that the latent consistency model (LCM) learns consistency noise-to-data mappings on the ODE-trajectory and therefore shows more semantic consistency in the subject identity, structural information and color preservation, we propose InterLCM to leverage the LCM for its superior semantic consistency and efficiency to counter the above issues. Treating low-quality images as the intermediate state of LCM, InterLCM achieves a balance between fidelity and quality by starting from earlier LCM steps. LCM also allows the integration of perceptual loss during training, leading to improved restoration quality, particularly in real-world scenarios. To mitigate structural and semantic uncertainties, InterLCM incorporates a Visual Module to extract visual features and a Spatial Encoder to capture spatial details, enhancing the fidelity of restored images. Extensive experiments demonstrate that InterLCM outperforms existing approaches in both synthetic and real-world datasets while also achieving faster inference speed.

InterLCM: Low-Quality Images as Intermediate States of Latent Consistency Models for Effective Blind Face Restoration

TL;DR

This work tackles blind face restoration under unknown degradations by addressing diffusion priors' weak semantic coherence and slow sampling. It introduces InterLCM, a framework that grounds restoration in latent consistency models and treats the low-quality input as an intermediate LCM state, enabling a few-step, semantically stable reconstruction. A Visual Module and a Spatial Encoder inject face-specific semantics and structural priors, and the training combines reconstruction, perceptual, and adversarial losses to improve fidelity. Across synthetic and real-world datasets, InterLCM achieves superior restoration quality with faster inference than traditional diffusion-based methods, demonstrating practical impact for real-world BFR tasks.

Abstract

Diffusion priors have been used for blind face restoration (BFR) by fine-tuning diffusion models (DMs) on restoration datasets to recover low-quality images. However, the naive application of DMs presents several key limitations. (i) The diffusion prior has inferior semantic consistency (e.g., ID, structure and color.), increasing the difficulty of optimizing the BFR model; (ii) reliance on hundreds of denoising iterations, preventing the effective cooperation with perceptual losses, which is crucial for faithful restoration. Observing that the latent consistency model (LCM) learns consistency noise-to-data mappings on the ODE-trajectory and therefore shows more semantic consistency in the subject identity, structural information and color preservation, we propose InterLCM to leverage the LCM for its superior semantic consistency and efficiency to counter the above issues. Treating low-quality images as the intermediate state of LCM, InterLCM achieves a balance between fidelity and quality by starting from earlier LCM steps. LCM also allows the integration of perceptual loss during training, leading to improved restoration quality, particularly in real-world scenarios. To mitigate structural and semantic uncertainties, InterLCM incorporates a Visual Module to extract visual features and a Spatial Encoder to capture spatial details, enhancing the fidelity of restored images. Extensive experiments demonstrate that InterLCM outperforms existing approaches in both synthetic and real-world datasets while also achieving faster inference speed.

Paper Structure

This paper contains 34 sections, 7 equations, 28 figures, 10 tables, 1 algorithm.

Figures (28)

  • Figure 1: (Left) The intermediate states in 4-step LCM and SD Turbo models. The network used in LCM maps to the real image space, while SD Turbo progressively denoises the noisy image. (Right) Given the prompt "A headshot of a man with hat and glasses", we generate 1000 images with both LCM and SD Turbo models. Then we use DreamSim, SSIM, and color histogram distance (HDist) to measure the semantic consistency in the subject identity, spatial structure and color preservation.
  • Figure 2: (Left) The 4-step LCM map its origin at each sampling step: Noise$\xrightarrow[]{\text{1st step}}$Sampling data$\xrightarrow[]{\text{add noise}}$Noisy data$\xrightarrow[]{\text{2nd step}}$Sampling data$\xrightarrow[]{\text{add noise}}$Noisy data$\xrightarrow[]{\text{3rd step}}$Sampling data$\xrightarrow[]{\text{add noise}}$Noisy data$\xrightarrow[]{\text{4th step}}$Sampling data. In the first step, the origin image is predicted from random noise. In each remaining step, noise is added to the origin image produced in the previous step. (Right) The predicted origin images are shown for each step (the first row). The random noise and noisy data from the first to third steps (the second row). For example, given one prompt case "blond woman with red glasses and a black shirt", the generated image at each step shows semantic consistency in the subject identity, structural information and color constancy (the first row).
  • Figure 3: Overview of the proposed InterLCM framework. The Visual Module takes LQ images to output the visual embeddings. A Spatial Encoder is used to provide structure information. We consider the LQ image as the intermediate state of LCM. Through standard LCM conditioned with both the visual embedding and spatial features, the LQ input can be reconstructed as a HQ image.
  • Figure 4: t-SNE visualizations of feature distributions show the first step sampling similarity of LCM and the LQ image (FID=103.70), and their noisy intermediate states after LCM 2nd-step noise diffusion (FID=2.83).
  • Figure 5: Naive LCM alters the original semantics of the LQ image (e.g., hair).
  • ...and 23 more figures