Table of Contents
Fetching ...

Cross-Spectral Attention for Unsupervised RGB-IR Face Verification and Person Re-identification

Kshitij Nikhal, Cedric Nimpa Fondje, Benjamin S. Riggan

TL;DR

This work tackles unsupervised cross-spectral matching between RGB and IR for face verification and person ReID. It introduces a threefold framework combining a Cross-Spectral Attention Network (CSAN), a Pseudo Triplet Loss with Offline Cross-Spectral Voting (PTL), and Pixel-Channel Sparsity (PCS) to learn domain-invariant, discriminative representations without labeled data. By leveraging intra-domain agglomerative clustering, cross-spectral voting, and sparsity regularization, the method achieves competitive or superior results on RegDB and ARL-VTF, often surpassing some supervised baselines in unsupervised settings. The approach offers a practical, scalable pathway for cross-spectral biometric tasks and demonstrates strong potential for generalization to related unsupervised cross-domain recognition problems.

Abstract

Cross-spectral biometrics, such as matching imagery of faces or persons from visible (RGB) and infrared (IR) bands, have rapidly advanced over the last decade due to increasing sensitivity, size, quality, and ubiquity of IR focal plane arrays and enhanced analytics beyond the visible spectrum. Current techniques for mitigating large spectral disparities between RGB and IR imagery often include learning a discriminative common subspace by exploiting precisely curated data acquired from multiple spectra. Although there are challenges with determining robust architectures for extracting common information, a critical limitation for supervised methods is poor scalability in terms of acquiring labeled data. Therefore, we propose a novel unsupervised cross-spectral framework that combines (1) a new pseudo triplet loss with cross-spectral voting, (2) a new cross-spectral attention network leveraging multiple subspaces, and (3) structured sparsity to perform more discriminative cross-spectral clustering. We extensively compare our proposed RGB-IR biometric learning framework (and its individual components) with recent and previous state-of-the-art models on two challenging benchmark datasets: DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF) and RegDB person re-identification dataset, and, in some cases, achieve performance superior to completely supervised methods.

Cross-Spectral Attention for Unsupervised RGB-IR Face Verification and Person Re-identification

TL;DR

This work tackles unsupervised cross-spectral matching between RGB and IR for face verification and person ReID. It introduces a threefold framework combining a Cross-Spectral Attention Network (CSAN), a Pseudo Triplet Loss with Offline Cross-Spectral Voting (PTL), and Pixel-Channel Sparsity (PCS) to learn domain-invariant, discriminative representations without labeled data. By leveraging intra-domain agglomerative clustering, cross-spectral voting, and sparsity regularization, the method achieves competitive or superior results on RegDB and ARL-VTF, often surpassing some supervised baselines in unsupervised settings. The approach offers a practical, scalable pathway for cross-spectral biometric tasks and demonstrates strong potential for generalization to related unsupervised cross-domain recognition problems.

Abstract

Cross-spectral biometrics, such as matching imagery of faces or persons from visible (RGB) and infrared (IR) bands, have rapidly advanced over the last decade due to increasing sensitivity, size, quality, and ubiquity of IR focal plane arrays and enhanced analytics beyond the visible spectrum. Current techniques for mitigating large spectral disparities between RGB and IR imagery often include learning a discriminative common subspace by exploiting precisely curated data acquired from multiple spectra. Although there are challenges with determining robust architectures for extracting common information, a critical limitation for supervised methods is poor scalability in terms of acquiring labeled data. Therefore, we propose a novel unsupervised cross-spectral framework that combines (1) a new pseudo triplet loss with cross-spectral voting, (2) a new cross-spectral attention network leveraging multiple subspaces, and (3) structured sparsity to perform more discriminative cross-spectral clustering. We extensively compare our proposed RGB-IR biometric learning framework (and its individual components) with recent and previous state-of-the-art models on two challenging benchmark datasets: DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF) and RegDB person re-identification dataset, and, in some cases, achieve performance superior to completely supervised methods.

Paper Structure

This paper contains 17 sections, 7 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our proposed unsupervised cross-spectrum framework, which learns spectral invariance via cross-spectral attention without labeled data, applied to ReID (left) and FaceVeri (right).
  • Figure 2: Our framework generates IR-specific, RGB-specific and a common feature representations from a shared truncated VGG backbone encoder and is optimized using our PTL voting scheme and pixel-channel sparsity term. IR-specific and RGB-specific representations are first used to compute the cross-spectral attention. Then, both RGB and IR (i.e., gallery and probe) common representations are both multiplied by the cross-spectrum attention to emphasize mutually beneficial characteristics. The corresponding dimensions are denoted in blue where $H$ is the height, $W$ is the width, $C$ is the number of input channels and #$patches$ is the number of patches generated.
  • Figure 3: First, RGB and IR clusters are generated separately using agglomerative clustering. Next, an RGB cluster with a high cluster quality score is randomly sampled. Each sample of the selected RGB cluster votes for the nearest (in similarity) IR cluster. Lastly, pseudo anchor and positive samples are mined from newly associated RGB and IR cluster, and negative samples are mined from the nearest RGB samples.
  • Figure 4: The top-5 and top-2 retrievals on the person and face biometric tasks. Green denotes correct retrieval while red denotes incorrect retrieval.
  • Figure 5: Training performance trends (a), cluster formation analysis (b), and hyper-parameter analysis (c, d).
  • ...and 2 more figures