Table of Contents
Fetching ...

DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning

Bin Wu, Wei Wang, Yahui Liu, Zixiang Li, Yao Zhao

TL;DR

DiffusionReward introduces Reward Feedback Learning to blind face restoration by employing a Face Reward Model trained on human preferences to provide gradient feedback during diffusion denoising. The method couples a dynamic FRM with a structural consistency constraint and weight regularization to preserve identity while enhancing facial detail, mitigating reward hacking through continual FRM updates. Experiments on synthetic and real-world datasets show state-of-the-art improvements in perceptual quality and identity fidelity across diffusion-based BFR baselines. The approach offers a principled way to align restoration outputs with human preferences, with practical implications for high-fidelity, identity-preserving face restoration in the wild.

Abstract

Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: https://github.com/01NeuralNinja/DiffusionReward.

DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning

TL;DR

DiffusionReward introduces Reward Feedback Learning to blind face restoration by employing a Face Reward Model trained on human preferences to provide gradient feedback during diffusion denoising. The method couples a dynamic FRM with a structural consistency constraint and weight regularization to preserve identity while enhancing facial detail, mitigating reward hacking through continual FRM updates. Experiments on synthetic and real-world datasets show state-of-the-art improvements in perceptual quality and identity fidelity across diffusion-based BFR baselines. The approach offers a principled way to align restoration outputs with human preferences, with practical implications for high-fidelity, identity-preserving face restoration in the wild.

Abstract

Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: https://github.com/01NeuralNinja/DiffusionReward.

Paper Structure

This paper contains 28 sections, 9 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: An example of issues with diffusion-based face restoration methods. After enhancement with ReFL, the issues in the base model are significantly mitigated.
  • Figure 2: Training framework of the Face Reward Model. We first train a SVM cortes1995support classifier for automated annotation. The classifier is trained with the metric vectors ($\pmb{v}_1$, $\pmb{v}_2$) and annotated supervision signals (Left). The face reward model is based on the CLIP radford2021learning architecture (Right), where the last 20 layers of the image encoder $E_I$ and the last 11 layers of the text encoder $E_t$ are trainable, while the remaining parameters are frozen. $s_1$ and $s_2$ represents the score, derived from the similarity between the image embedding and the text embedding (e.g., $<\pmb{e}_{i_1}, \pmb{e}_t>$).
  • Figure 2: Performance comparison of face restoration methods on wild datasets. The highest score for each metric is highlighted in red, and the second-highest in blue. Metrics with $\uparrow$ indicate higher is better. The values in parentheses represent our method's improvements over base models.
  • Figure 3: Our ReFL training framework. (Left) We introduce multiple constraints to optimize the generation module $g_\theta$, including $\mathcal{L}_\text{reward}$, $\mathcal{L}_\text{reg}$ and $\mathcal{L}_\text{struct}$ (See details in Section \ref{['subsec:refl']}). (Right) For training efficiency, these constraints are applied solely on the last denoising step.
  • Figure 3: Performance Comparison of FRM and HPS v2 Reward Models
  • ...and 10 more figures