Enhancing Blind Face Restoration through Online Reinforcement Learning
Bin Wu, Yahui Liu, Chi Zhang, Yao Zhao, Wei Wang
TL;DR
This work addresses the ill-posed nature of Blind Face Restoration by introducing Likelihood-Regularized Policy Optimization (LRPO), the first online reinforcement learning framework for BFR. LRPO uses a policy to sample multiple HQ candidates per low-quality input and learns via a composite reward, GT-guided likelihood regularization, and noise-aware advantages, post-training the base diffusion model and discarding the RL component at inference. Experiments show LRPO achieves state-of-the-art results across synthetic and real-world datasets, with both objective metrics and human preferences favoring the restored faces. The approach advances BFR by enabling principled exploration of diverse restorations while maintaining identity fidelity and perceptual realism.
Abstract
Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.
