Table of Contents
Fetching ...

DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer

Wei-Ting Chen, Gurunandan Krishnan, Qiang Gao, Sy-Yen Kuo, Sizhuo Ma, Jian Wang

TL;DR

This work tackles the challenge of robust generic face image quality assessment (GFIQA) by introducing a transformer-based framework that decouples content from degradation through Self-Supervised Dual-Set Degradation Representation Learning (DSL) and enhances perceptual sensitivity via a landmark-guided transformer. DSL learns global degradation representations by contrasting synthetic degradations on high-quality faces with real-world degradations, using a soft proximity mapping and a bidirectional contrastive loss to align cross-set degradations. A landmark-detection module and positional encoding focus the model on salient facial regions, improving regional confidence and overall MOS prediction. The authors also present CGFIQA-40k, a large, balanced dataset designed to reduce gender and skin-tone biases. Empirical results across GFIQA-20k, PIQ23, and CGFIQA-40k show that DSL-FIQA achieves superior correlation measures ($PLCC$ and $SRCC$) compared with strong baselines, underscoring the method’s robustness and practical value for real-world face image quality assessment.

Abstract

Generic Face Image Quality Assessment (GFIQA) evaluates the perceptual quality of facial images, which is crucial in improving image restoration algorithms and selecting high-quality face images for downstream tasks. We present a novel transformer-based method for GFIQA, which is aided by two unique mechanisms. First, a Dual-Set Degradation Representation Learning (DSL) mechanism uses facial images with both synthetic and real degradations to decouple degradation from content, ensuring generalizability to real-world scenarios. This self-supervised method learns degradation features on a global scale, providing a robust alternative to conventional methods that use local patch information in degradation learning. Second, our transformer leverages facial landmarks to emphasize visually salient parts of a face image in evaluating its perceptual quality. We also introduce a balanced and diverse Comprehensive Generic Face IQA (CGFIQA-40k) dataset of 40K images carefully designed to overcome the biases, in particular the imbalances in skin tone and gender representation, in existing datasets. Extensive analysis and evaluation demonstrate the robustness of our method, marking a significant improvement over prior methods.

DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer

TL;DR

This work tackles the challenge of robust generic face image quality assessment (GFIQA) by introducing a transformer-based framework that decouples content from degradation through Self-Supervised Dual-Set Degradation Representation Learning (DSL) and enhances perceptual sensitivity via a landmark-guided transformer. DSL learns global degradation representations by contrasting synthetic degradations on high-quality faces with real-world degradations, using a soft proximity mapping and a bidirectional contrastive loss to align cross-set degradations. A landmark-detection module and positional encoding focus the model on salient facial regions, improving regional confidence and overall MOS prediction. The authors also present CGFIQA-40k, a large, balanced dataset designed to reduce gender and skin-tone biases. Empirical results across GFIQA-20k, PIQ23, and CGFIQA-40k show that DSL-FIQA achieves superior correlation measures ( and ) compared with strong baselines, underscoring the method’s robustness and practical value for real-world face image quality assessment.

Abstract

Generic Face Image Quality Assessment (GFIQA) evaluates the perceptual quality of facial images, which is crucial in improving image restoration algorithms and selecting high-quality face images for downstream tasks. We present a novel transformer-based method for GFIQA, which is aided by two unique mechanisms. First, a Dual-Set Degradation Representation Learning (DSL) mechanism uses facial images with both synthetic and real degradations to decouple degradation from content, ensuring generalizability to real-world scenarios. This self-supervised method learns degradation features on a global scale, providing a robust alternative to conventional methods that use local patch information in degradation learning. Second, our transformer leverages facial landmarks to emphasize visually salient parts of a face image in evaluating its perceptual quality. We also introduce a balanced and diverse Comprehensive Generic Face IQA (CGFIQA-40k) dataset of 40K images carefully designed to overcome the biases, in particular the imbalances in skin tone and gender representation, in existing datasets. Extensive analysis and evaluation demonstrate the robustness of our method, marking a significant improvement over prior methods.
Paper Structure (29 sections, 9 equations, 10 figures, 8 tables)

This paper contains 29 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: PLCC vs. SRCC Comparison on CGFIQA-40k and GFIQA-20k su2023going datasets. DSL-FIQA, denoted by red triangular points, outperforms other methods (ReIQA saha2023re, StyleGAN-IQA su2023going, MANIQA yang2022maniqa, TRIQ tu2021rapique) and can provide a superior image quality assessment of facial images.
  • Figure 2: Overview of our proposed model. The model contains a core GFIQA network, a degradation extraction network, and a landmark detection network. In our approach, face images are cropped into several patches to fit the input size requirements of the pre-trained ViT feature extractor (See \ref{['sec:model_overview']}). Each patch is then processed individually, and their Mean Opinion Scores (MOS) are averaged to determine the final quality score. For clarity in the figure, the segmentation of the image into patches is not shown.
  • Figure 3: Dual-Set Degradation Representation Learning (DSL) Illustrated. On the left, the process of contrastive optimization is depicted, utilizing two unique image sets. Degradation representations are extracted, followed by soft proximity mapping (SPM) calculations and contrastive optimization, compelling the degradation encoder to focus on learning specific degradation features. The right side emphasizes the bidirectional characteristic of our approach, highlighting the comprehensive strategy for identifying and understanding image degradations through contrastive learning.
  • Figure 4: t-SNE visualization of degradation representation extracted using patch-based and DSL-based methods. Unlike the patch-based method, DSL results in well-demarcated clusters for various types of degradation, thereby proving the effectiveness of the learned representations.
  • Figure 5: Comparison of using landmark mechanism to guide the GFIQA network. We present the regional confidence maps and the corresponding input. With landmark guidance, the confidence maps focus more on key facial landmarks, providing a more discriminative assessment. In contrast, without landmark guidance, the confidence maps tend to cover the entire face, often lacking specificity and even assigning higher confidence to irrelevant areas (e.g., background).
  • ...and 5 more figures