Table of Contents
Fetching ...

Enhancing Blind Face Restoration through Online Reinforcement Learning

Bin Wu, Yahui Liu, Chi Zhang, Yao Zhao, Wei Wang

TL;DR

This work addresses the ill-posed nature of Blind Face Restoration by introducing Likelihood-Regularized Policy Optimization (LRPO), the first online reinforcement learning framework for BFR. LRPO uses a policy to sample multiple HQ candidates per low-quality input and learns via a composite reward, GT-guided likelihood regularization, and noise-aware advantages, post-training the base diffusion model and discarding the RL component at inference. Experiments show LRPO achieves state-of-the-art results across synthetic and real-world datasets, with both objective metrics and human preferences favoring the restored faces. The approach advances BFR by enabling principled exploration of diverse restorations while maintaining identity fidelity and perceptual realism.

Abstract

Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.

Enhancing Blind Face Restoration through Online Reinforcement Learning

TL;DR

This work addresses the ill-posed nature of Blind Face Restoration by introducing Likelihood-Regularized Policy Optimization (LRPO), the first online reinforcement learning framework for BFR. LRPO uses a policy to sample multiple HQ candidates per low-quality input and learns via a composite reward, GT-guided likelihood regularization, and noise-aware advantages, post-training the base diffusion model and discarding the RL component at inference. Experiments show LRPO achieves state-of-the-art results across synthetic and real-world datasets, with both objective metrics and human preferences favoring the restored faces. The approach advances BFR by enabling principled exploration of diverse restorations while maintaining identity fidelity and perceptual realism.

Abstract

Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.

Paper Structure

This paper contains 18 sections, 23 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: (Left) Our proposed online RL-based face restoration framework: an LQ face is input to the policy network $\pi_{\theta}$, which generates a group of HQ face candidates. The reward function evaluates each candidate and converts the scores into within-group relative advantages that guide policy optimization for the next iteration. The comparisons (Right) demonstrate the quality improvement achieved through RL optimization over the base model.
  • Figure 2: The overview of our proposed LRPO framework. The policy network produces multiple HQ restoration candidates from a single LQ input, which are then assessed by the reward function and transformed into advantage scores. The framework assigns weighted advantage scores to individual denoising steps according to their contribution to restoration quality, and integrates ground-truth guided likelihood regularization into the RL optimization objective to maintain fidelity.
  • Figure 3: Qualitative results on CelebA-Test datasets. (Zoom in for details)
  • Figure 4: Qualitative results on real-world datasets. (Zoom in for details)
  • Figure 5: Ablation study visualizations.
  • ...and 7 more figures