Table of Contents
Fetching ...

IFViT: Interpretable Fixed-Length Representation for Fingerprint Matching via Vision Transformer

Yuhang Qiu, Honghui Chen, Xingbo Dong, Zheng Lin, Iman Yi Liao, Massimo Tistarelli, Zhe Jin

TL;DR

This work tackles the need for interpretable fingerprint matching by introducing IFViT, a two-stage framework that jointly learns dense pixel-wise correspondences for alignment and a fixed-length representation for matching. It leverages a ViT-based dense registration module to produce pixel-level correspondences and an ROI/global fusion in a second ViT Siamese module to obtain discriminative representations, with losses that enforce both correspondence quality and embedding separability. The approach achieves state-of-the-art or near-state-of-the-art performance across multiple public datasets while offering granular interpretability through visualizable correspondences and multi-level feature points, including minutiae and pores. Practically, IFViT improves robustness to low-quality and cross-sensor fingerprints and delivers interpretable decisions with a runtime of around 463 ms per pair, highlighting its potential for real-world biometric systems and secure matching workflows.

Abstract

Determining dense feature points on fingerprints used in constructing deep fixed-length representations for accurate matching, particularly at the pixel level, is of significant interest. To explore the interpretability of fingerprint matching, we propose a multi-stage interpretable fingerprint matching network, namely Interpretable Fixed-length Representation for Fingerprint Matching via Vision Transformer (IFViT), which consists of two primary modules. The first module, an interpretable dense registration module, establishes a Vision Transformer (ViT)-based Siamese Network to capture long-range dependencies and the global context in fingerprint pairs. It provides interpretable dense pixel-wise correspondences of feature points for fingerprint alignment and enhances the interpretability in the subsequent matching stage. The second module takes into account both local and global representations of the aligned fingerprint pair to achieve an interpretable fixed-length representation extraction and matching. It employs the ViTs trained in the first module with the additional fully connected layer and retrains them to simultaneously produce the discriminative fixed-length representation and interpretable dense pixel-wise correspondences of feature points. Extensive experimental results on diverse publicly available fingerprint databases demonstrate that the proposed framework not only exhibits superior performance on dense registration and matching but also significantly promotes the interpretability in deep fixed-length representations-based fingerprint matching.

IFViT: Interpretable Fixed-Length Representation for Fingerprint Matching via Vision Transformer

TL;DR

This work tackles the need for interpretable fingerprint matching by introducing IFViT, a two-stage framework that jointly learns dense pixel-wise correspondences for alignment and a fixed-length representation for matching. It leverages a ViT-based dense registration module to produce pixel-level correspondences and an ROI/global fusion in a second ViT Siamese module to obtain discriminative representations, with losses that enforce both correspondence quality and embedding separability. The approach achieves state-of-the-art or near-state-of-the-art performance across multiple public datasets while offering granular interpretability through visualizable correspondences and multi-level feature points, including minutiae and pores. Practically, IFViT improves robustness to low-quality and cross-sensor fingerprints and delivers interpretable decisions with a runtime of around 463 ms per pair, highlighting its potential for real-world biometric systems and secure matching workflows.

Abstract

Determining dense feature points on fingerprints used in constructing deep fixed-length representations for accurate matching, particularly at the pixel level, is of significant interest. To explore the interpretability of fingerprint matching, we propose a multi-stage interpretable fingerprint matching network, namely Interpretable Fixed-length Representation for Fingerprint Matching via Vision Transformer (IFViT), which consists of two primary modules. The first module, an interpretable dense registration module, establishes a Vision Transformer (ViT)-based Siamese Network to capture long-range dependencies and the global context in fingerprint pairs. It provides interpretable dense pixel-wise correspondences of feature points for fingerprint alignment and enhances the interpretability in the subsequent matching stage. The second module takes into account both local and global representations of the aligned fingerprint pair to achieve an interpretable fixed-length representation extraction and matching. It employs the ViTs trained in the first module with the additional fully connected layer and retrains them to simultaneously produce the discriminative fixed-length representation and interpretable dense pixel-wise correspondences of feature points. Extensive experimental results on diverse publicly available fingerprint databases demonstrate that the proposed framework not only exhibits superior performance on dense registration and matching but also significantly promotes the interpretability in deep fixed-length representations-based fingerprint matching.
Paper Structure (17 sections, 12 equations, 10 figures, 5 tables)

This paper contains 17 sections, 12 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Produced dense pixel-wise correspondences of feature points from IFViT in the case of (a) low-quality and (b) cross-sensor fingerprint pairs selected with different thresholds of confidence scores.
  • Figure 2: Overview of the IFViT architecture. The input fingerprint pair is processed by the ViT to obtain interpretable dense pixel-wise correspondences of feature points, which are in turn used to align the fingerprint pair. The aligned fingerprint pair is then enhanced by FingerNet and passed into the ViT by taking into account both local and global representations to obtain the discriminative fixed-length representation and interpretable dense pixel-wise correspondences of feature points in the matching result.
  • Figure 3: Examples of synthetic corrupted fingerprints simulated by diverse types of noises. (a) Original fingerprint (b) fingerprint processed by sensor noise (c) fingerprint processed by over-pressurization operation (d) fingerprint processed by dryness operation.
  • Figure 4: Cases prone to matching failure: (a) The similar global ridge-flow structures of the fingerprint from different identities. (b) The dissimilar representations in the local patch of different impressions from the same finger.
  • Figure 5: The procedure of extracting the overlapped fingerprint areas based on the given fingerprint pair.
  • ...and 5 more figures