Table of Contents
Fetching ...

CATFace: Cross-Attribute-Guided Transformer with Self-Attention Distillation for Low-Quality Face Recognition

Niloufar Alipour Talemi, Hossein Kashiani, Nasser M. Nasrabadi

TL;DR

CATFace tackles face recognition under low-quality imaging by integrating soft biometric attributes through a cross-attribute-guided transformer fusion (CATF) and a self-attention distillation framework. The method uses a two-branch network (FR and SB) trained in two steps, then fuses SB and FR features with CATF to capture long-range dependencies and enhance discriminative regions. Self-attention distillation aligns high-quality and low-quality attention maps via cosine-based losses, yielding quality-invariant representations and improved robustness across diverse datasets and degraded conditions. Extensive experiments across CelebA, MS1MV2, and cross-resolution benchmarks demonstrate substantial gains in FR accuracy and SB attribute prediction, particularly in mixed-quality and low-quality regimes, validating its practical impact for unconstrained biometrics.

Abstract

Although face recognition (FR) has achieved great success in recent years, it is still challenging to accurately recognize faces in low-quality images due to the obscured facial details. Nevertheless, it is often feasible to make predictions about specific soft biometric (SB) attributes, such as gender, and baldness even in dealing with low-quality images. In this paper, we propose a novel multi-branch neural network that leverages SB attribute information to boost the performance of FR. To this end, we propose a cross-attribute-guided transformer fusion (CATF) module that effectively captures the long-range dependencies and relationships between FR and SB feature representations. The synergy created by the reciprocal flow of information in the dual cross-attention operations of the proposed CATF module enhances the performance of FR. Furthermore, we introduce a novel self-attention distillation framework that effectively highlights crucial facial regions, such as landmarks by aligning low-quality images with those of their high-quality counterparts in the feature space. The proposed self-attention distillation regularizes our network to learn a unified quality-invariant feature representation in unconstrained environments. We conduct extensive experiments on various FR benchmarks varying in quality. Experimental results demonstrate the superiority of our FR method compared to state-of-the-art FR studies.

CATFace: Cross-Attribute-Guided Transformer with Self-Attention Distillation for Low-Quality Face Recognition

TL;DR

CATFace tackles face recognition under low-quality imaging by integrating soft biometric attributes through a cross-attribute-guided transformer fusion (CATF) and a self-attention distillation framework. The method uses a two-branch network (FR and SB) trained in two steps, then fuses SB and FR features with CATF to capture long-range dependencies and enhance discriminative regions. Self-attention distillation aligns high-quality and low-quality attention maps via cosine-based losses, yielding quality-invariant representations and improved robustness across diverse datasets and degraded conditions. Extensive experiments across CelebA, MS1MV2, and cross-resolution benchmarks demonstrate substantial gains in FR accuracy and SB attribute prediction, particularly in mixed-quality and low-quality regimes, validating its practical impact for unconstrained biometrics.

Abstract

Although face recognition (FR) has achieved great success in recent years, it is still challenging to accurately recognize faces in low-quality images due to the obscured facial details. Nevertheless, it is often feasible to make predictions about specific soft biometric (SB) attributes, such as gender, and baldness even in dealing with low-quality images. In this paper, we propose a novel multi-branch neural network that leverages SB attribute information to boost the performance of FR. To this end, we propose a cross-attribute-guided transformer fusion (CATF) module that effectively captures the long-range dependencies and relationships between FR and SB feature representations. The synergy created by the reciprocal flow of information in the dual cross-attention operations of the proposed CATF module enhances the performance of FR. Furthermore, we introduce a novel self-attention distillation framework that effectively highlights crucial facial regions, such as landmarks by aligning low-quality images with those of their high-quality counterparts in the feature space. The proposed self-attention distillation regularizes our network to learn a unified quality-invariant feature representation in unconstrained environments. We conduct extensive experiments on various FR benchmarks varying in quality. Experimental results demonstrate the superiority of our FR method compared to state-of-the-art FR studies.
Paper Structure (28 sections, 17 equations, 8 figures, 8 tables)

This paper contains 28 sections, 17 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Examples of face images with different degrees of degradation in various real-world FR benchmarks. In some images, the identity of the person is not easily recognizable due to the lack of some clues that are essential for FR. However, the gender of a person can still be inferred from those images. Therefore, leveraging some SB attributes like gender can enhance FR performance in challenging conditions. Note that M and F stand for male and female, respectively.
  • Figure 3: Multi-branch neural network with self-attention distillation for FR and SB attribute prediction. Note that MHSA stands for multi-head self-attention module. This diagram shows the first step of our two-step training process. The $Br_{FR}$ and $Br_{SB}$ branches are jointly trained in the first step of the training process. In the second step, to enrich the FR feature representations, the SB and FR feature representations are fused together through the proposed CATF module (see Fig. \ref{['fig:mbn']}). It should be noted that the global average pooling (GAP) and the final fully connected (FC) layers are removed from each branch for the second step of the training process.
  • Figure 4: Proposed cross-attribute-guided transformer fusion (CATF) module for FR. This diagram shows the second step of our two-step training process.
  • Figure 5: Proposed channel-wise attentional fusion (CAF) block.
  • Figure 6: Images corrupted by simulated atmospheric turbulence with strengths ranging from 0.25 to 2 (the first image is the original one).
  • ...and 3 more figures