Table of Contents
Fetching ...

ReF-LDM: A Latent Diffusion Model for Reference-based Face Image Restoration

Chi-Wei Hsiao, Yu-Lun Liu, Cheng-Kun Yang, Sheng-Po Kuo, Kevin Jou, Chia-Ping Chen

TL;DR

This paper tackles the challenge of preserving a subject's identity in face restoration from degraded LQ images by leveraging multiple HQ reference images. It introduces ReF-LDM, an LDM-based framework that uses a CacheKV mechanism to efficiently fuse references into the denoising process and a timestep-scaled identity loss to supervise identity features without degrading image quality. The authors also present FFHQ-Ref, a large reference-enabled dataset built from FFHQ with identity-consistent references for training and evaluation. Empirical results show that ReF-LDM achieves superior identity preservation and competitive perceptual quality against state-of-the-art methods, while offering flexible reference utilization and faster inference than alternative designs.

Abstract

While recent works on blind face image restoration have successfully produced impressive high-quality (HQ) images with abundant details from low-quality (LQ) input images, the generated content may not accurately reflect the real appearance of a person. To address this problem, incorporating well-shot personal images as additional reference inputs could be a promising strategy. Inspired by the recent success of the Latent Diffusion Model (LDM), we propose ReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned on one LQ image and multiple HQ reference images. Our model integrates an effective and efficient mechanism, CacheKV, to leverage the reference images during the generation process. Additionally, we design a timestep-scaled identity loss, enabling our LDM-based model to focus on learning the discriminating features of human faces. Lastly, we construct FFHQ-Ref, a dataset consisting of 20,405 high-quality (HQ) face images with corresponding reference images, which can serve as both training and evaluation data for reference-based face restoration models.

ReF-LDM: A Latent Diffusion Model for Reference-based Face Image Restoration

TL;DR

This paper tackles the challenge of preserving a subject's identity in face restoration from degraded LQ images by leveraging multiple HQ reference images. It introduces ReF-LDM, an LDM-based framework that uses a CacheKV mechanism to efficiently fuse references into the denoising process and a timestep-scaled identity loss to supervise identity features without degrading image quality. The authors also present FFHQ-Ref, a large reference-enabled dataset built from FFHQ with identity-consistent references for training and evaluation. Empirical results show that ReF-LDM achieves superior identity preservation and competitive perceptual quality against state-of-the-art methods, while offering flexible reference utilization and faster inference than alternative designs.

Abstract

While recent works on blind face image restoration have successfully produced impressive high-quality (HQ) images with abundant details from low-quality (LQ) input images, the generated content may not accurately reflect the real appearance of a person. To address this problem, incorporating well-shot personal images as additional reference inputs could be a promising strategy. Inspired by the recent success of the Latent Diffusion Model (LDM), we propose ReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned on one LQ image and multiple HQ reference images. Our model integrates an effective and efficient mechanism, CacheKV, to leverage the reference images during the generation process. Additionally, we design a timestep-scaled identity loss, enabling our LDM-based model to focus on learning the discriminating features of human faces. Lastly, we construct FFHQ-Ref, a dataset consisting of 20,405 high-quality (HQ) face images with corresponding reference images, which can serve as both training and evaluation data for reference-based face restoration models.

Paper Structure

This paper contains 44 sections, 6 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Reference-based face image restoration. Given an input low-quality face image (a), a Latent Diffusion Model (LDM) can reconstruct a high-quality image (b); however, it may not be faithful to the individual's facial identity. To address this problem, we propose ReF-LDM, which restores a high-quality image with faithful details (c) by utilizing additional reference images (d).
  • Figure 2: The proposed ReF-LDM pipeline. Our model accepts a low-quality image and multiple high-quality reference images as input and generates a high-quality image. The blue top panel alone represents a typical LDM rombach2022high denoising process. For an LQ image $\textbf{x}_{\text{LQ}}$, we concatenate its latent $\textbf{z}_{\text{LQ}}$ with $\textbf{z}_t$ along the channel axis to serve as the input for the denoising U-net. For the reference images $\{\textbf{x}_{\text{ref}}\}$, we design a CacheKV mechanism, depicted in the red panel, to extract and cache their key and value tokens using the same denoising U-net for just one time. These cached KV tokens can then be utlized repeatedly in each of the $T$ timesteps of the main denoising process. During training, we adopt the classic LDM loss ($\mathcal{L}_\mathrm{LDM}$) and introduce a timestep-scaled identity loss ($\mathcal{L}_\mathrm{time\,ID}$).
  • Figure 3: Different mechanisms for incorporating reference images into the main denoising process.
  • Figure 3: Ablation results for the timestep-scaled identity loss.
  • Figure 4: Visual ablation results for the timestep-scaled identity loss.
  • ...and 12 more figures