Table of Contents
Fetching ...

W-Net: A Facial Feature-Guided Face Super-Resolution Network

Hao Liu, Yang Yang, Yunxia Liu

TL;DR

W-Net is proposed, a novel network architecture called W-Net that leverages meticulously designed Parsing Block to fully exploit the resolution potential of LR image and utilizes a facial parsing graph as a mask to balance the performance of reconstructed facial images between perceptual quality and pixel accuracy.

Abstract

Face Super-Resolution (FSR) aims to recover high-resolution (HR) face images from low-resolution (LR) ones. Despite the progress made by convolutional neural networks in FSR, the results of existing approaches are not ideal due to their low reconstruction efficiency and insufficient utilization of prior information. Considering that faces are highly structured objects, effectively leveraging facial priors to improve FSR results is a worthwhile endeavor. This paper proposes a novel network architecture called W-Net to address this challenge. W-Net leverages meticulously designed Parsing Block to fully exploit the resolution potential of LR image. We use this parsing map as an attention prior, effectively integrating information from both the parsing map and LR images. Simultaneously, we perform multiple fusions in various dimensions through the W-shaped network structure combined with the LPF(LR-Parsing Map Fusion Module). Additionally, we utilize a facial parsing graph as a mask, assigning different weights and loss functions to key facial areas to balance the performance of our reconstructed facial images between perceptual quality and pixel accuracy. We conducted extensive comparative experiments, not only limited to conventional facial super-resolution metrics but also extending to downstream tasks such as facial recognition and facial keypoint detection. The experiments demonstrate that W-Net exhibits outstanding performance in quantitative metrics, visual quality, and downstream tasks.

W-Net: A Facial Feature-Guided Face Super-Resolution Network

TL;DR

W-Net is proposed, a novel network architecture called W-Net that leverages meticulously designed Parsing Block to fully exploit the resolution potential of LR image and utilizes a facial parsing graph as a mask to balance the performance of reconstructed facial images between perceptual quality and pixel accuracy.

Abstract

Face Super-Resolution (FSR) aims to recover high-resolution (HR) face images from low-resolution (LR) ones. Despite the progress made by convolutional neural networks in FSR, the results of existing approaches are not ideal due to their low reconstruction efficiency and insufficient utilization of prior information. Considering that faces are highly structured objects, effectively leveraging facial priors to improve FSR results is a worthwhile endeavor. This paper proposes a novel network architecture called W-Net to address this challenge. W-Net leverages meticulously designed Parsing Block to fully exploit the resolution potential of LR image. We use this parsing map as an attention prior, effectively integrating information from both the parsing map and LR images. Simultaneously, we perform multiple fusions in various dimensions through the W-shaped network structure combined with the LPF(LR-Parsing Map Fusion Module). Additionally, we utilize a facial parsing graph as a mask, assigning different weights and loss functions to key facial areas to balance the performance of our reconstructed facial images between perceptual quality and pixel accuracy. We conducted extensive comparative experiments, not only limited to conventional facial super-resolution metrics but also extending to downstream tasks such as facial recognition and facial keypoint detection. The experiments demonstrate that W-Net exhibits outstanding performance in quantitative metrics, visual quality, and downstream tasks.
Paper Structure (16 sections, 29 equations, 8 figures, 3 tables)

This paper contains 16 sections, 29 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The W-Net model utilizes low-quality images to obtain face parsing maps as attention priors, effectively performing face super-resolution through the fusion of features from multiple scales of parsing maps and low-quality images.
  • Figure 2: The Parsing Block consists of shallow convolutional layers and residual blocks to extract deep features, followed by HourGlass blocks to extract facial landmark features. After passing through attention units and convolutional layers to adjust channel numbers, the facial parsing map is obtained.
  • Figure 3: The LPF is composed of multiple convolutional layers and different attention layers. It weights the LR and ParsingMap at the pixel level. Simultaneously, it utilizes multiple identity connections to form the final output.
  • Figure 4: Visual comparison with state-of-the-art facial super-resolution methods. The low-resolution facial images are sized 32×32 (top two rows) and 16×16 (bottom two rows), upscaled by factors of four and eight, respectively. Better zoom in to see the detail.
  • Figure 5: Different FSR methods use openface to detect the Euclidean distance between face key points and HR key points.Better zoom in to see the detail.
  • ...and 3 more figures