Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

Wanglong Lu; Jikai Wang; Tao Wang; Kaihao Zhang; Xianta Jiang; Hanli Zhao

Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

Wanglong Lu, Jikai Wang, Tao Wang, Kaihao Zhang, Xianta Jiang, Hanli Zhao

TL;DR

This work addresses blind face restoration by introducing a diffusion-based visual style prompt learning framework that operates in the latent space $\mathcal{W}^+$ of a pre-trained StyleGAN. A diffusion-based style prompt module generates high-quality latent cues $\boldsymbol{w}^0$, which are combined with a StyleGAN facial feature bank and a style-modulated aggregation transformer (SMART) within a restoration auto-encoder to produce $\mathbf{I}_{out}$. The training schema jointly learns the style encoder and code diffuser with diffusion, LPIPS, and identity losses, then trains the restoration network with adversarial objectives, achieving superior perceptual quality on synthetic and real-world data and benefiting downstream tasks like landmark detection and emotion recognition. The approach demonstrates a practical, interpretable way to leverage generative priors for restoration, with potential extensions to incorporate textual prompts and broader video-based applications.

Abstract

Blind face restoration aims to recover high-quality facial images from various unidentified sources of degradation, posing significant challenges due to the minimal information retrievable from the degraded images. Prior knowledge-based methods, leveraging geometric priors and facial features, have led to advancements in face restoration but often fall short of capturing fine details. To address this, we introduce a visual style prompt learning framework that utilizes diffusion probabilistic models to explicitly generate visual prompts within the latent space of pre-trained generative models. These prompts are designed to guide the restoration process. To fully utilize the visual prompts and enhance the extraction of informative and rich patterns, we introduce a style-modulated aggregation transformation layer. Extensive experiments and applications demonstrate the superiority of our method in achieving high-quality blind face restoration. The source code is available at \href{https://github.com/LonglongaaaGo/VSPBFR}{https://github.com/LonglongaaaGo/VSPBFR}.

Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

TL;DR

This work addresses blind face restoration by introducing a diffusion-based visual style prompt learning framework that operates in the latent space

of a pre-trained StyleGAN. A diffusion-based style prompt module generates high-quality latent cues

, which are combined with a StyleGAN facial feature bank and a style-modulated aggregation transformer (SMART) within a restoration auto-encoder to produce

. The training schema jointly learns the style encoder and code diffuser with diffusion, LPIPS, and identity losses, then trains the restoration network with adversarial objectives, achieving superior perceptual quality on synthetic and real-world data and benefiting downstream tasks like landmark detection and emotion recognition. The approach demonstrates a practical, interpretable way to leverage generative priors for restoration, with potential extensions to incorporate textual prompts and broader video-based applications.

Abstract

Paper Structure (17 sections, 9 equations, 11 figures, 5 tables)

This paper contains 17 sections, 9 equations, 11 figures, 5 tables.

Introduction
Related work
Blind face restoration
Generative image synthesis
Method
Overview
Diffusion-based style prompt module
Restoration auto-encoder
Module training
Experimental results and comparisons
Settings
Comparison with SOTA methods
Ablation study
Analysis of style prompt learning
Analysis of SMART layer
...and 2 more sections

Figures (11)

Figure 1: The overall pipeline of our framework: the degraded image is processed through a diffusion-based style prompt module (a) to get denoised codes $\boldsymbol{w}^0$ through $T$ diffusion steps, beginning from noise codes $\boldsymbol{w}^T$. Then, the restoration auto-encoder (c) processes the degraded image, using the denoised codes $\boldsymbol{w}^0$, random codes $\hat{\boldsymbol{z}}$, and a global code $\boldsymbol{c}$ as style prompts. The network also leverages prior features from the facial feature bank (b), integrating them through a fusion process $f(\cdot)$, to achieve the restored image.
Figure 2: The detailed diffusion ($\leftarrow$) and denoising ($\rightarrow$) processes in the style latent space. We also show the corresponding inverted images of latent codes in steps.
Figure 3: Illustration of the style-modulated aggregation transformation (SMART).
Figure 4: Visual comparisons of our method and the SOTA facial restoration methods.
Figure 5: Visual comparisons of our method and the SOTA facial restoration methods.
...and 6 more figures

Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

TL;DR

Abstract

Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

Authors

TL;DR

Abstract

Table of Contents

Figures (11)