Table of Contents
Fetching ...

OSDFace: One-Step Diffusion Model for Face Restoration

Jingkai Wang, Jue Gong, Lin Zhang, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang, Xiaokang Yang

TL;DR

OSDFace introduces a one-step diffusion framework for face restoration that achieves high fidelity with fast inference. It combines a Visual Representation Embedder (VRE) to extract priors from low-quality faces and a single denoising step guided by a learnable prompt, enabling efficient HQ reconstruction. An ArcFace-based facial identity loss and GAN guidance further ensure identity preservation and distribution alignment with ground truth. Across synthetic and real-world datasets, OSDFace attains state-of-the-art perceptual and fidelity metrics while reducing computation, illustrating the practical viability of priors-informed one-step diffusion for faces.

Abstract

Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject's identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released at https://github.com/jkwang28/OSDFace.

OSDFace: One-Step Diffusion Model for Face Restoration

TL;DR

OSDFace introduces a one-step diffusion framework for face restoration that achieves high fidelity with fast inference. It combines a Visual Representation Embedder (VRE) to extract priors from low-quality faces and a single denoising step guided by a learnable prompt, enabling efficient HQ reconstruction. An ArcFace-based facial identity loss and GAN guidance further ensure identity preservation and distribution alignment with ground truth. Across synthetic and real-world datasets, OSDFace attains state-of-the-art perceptual and fidelity metrics while reducing computation, illustrating the practical viability of priors-informed one-step diffusion for faces.

Abstract

Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject's identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released at https://github.com/jkwang28/OSDFace.

Paper Structure

This paper contains 15 sections, 16 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visual samples of diffusion-based face restoration methods. We provide multiply-accumulate operations (MACs), time, and number of timesteps during inference. Our OSDFace achieves a more natural and faithful visual result than other ones.
  • Figure 2: Performance comparison on the CelebA-Test. Those metrics which smaller scores indicate better image quality, are inverted and normalized for display. OSDFace achieves leading scores on most metrics with only one diffusion step.
  • Figure 3: Training framework of OSDFace. First, to establish a visual representation embedder (VRE), we train the autoencoder and VQ dictionary for HQ and LQ face domains using self-reconstruction and feature association loss $\mathcal{L}_{\text{assoc}}$. Then, we use the VRE containing LQ encoder and dictionary to embed the LQ face $I_L$, producing the visual prompt embedding $p_L$. Next, the LQ image $I_L$ along with $p_L$ are inputed into the generator $\mathcal{G}_\theta$ to yield the predicted HQ face $\hat{I}_H$: $\hat{I}_H=\mathcal{G}_\theta(I_L; \operatorname{VRE}(I_L))$. The generator $\mathcal{G}_\theta$ incorporates the pretrained VAE and UNet from Stable Diffusion, with only the UNet fine-tuned via LoRA. Additionally, a series of feature alignment losses are applied to ensure the generation of harmonious and coherent face images. The generator and discriminator are trained alternately.
  • Figure 4: Attention maps of VRE and visual comparison of prompt embedding generation. We use CelebA-Test 0262 as an example.
  • Figure 5: Visual comparison of the synthetic CelebA-Test dataset in challenging cases. Please zoom in for a better view.
  • ...and 3 more figures