Table of Contents
Fetching ...

TransFIRA: Transfer Learning for Face Image Recognizability Assessment

Allen Tu, Kartik Narayan, Joshua Gleason, Jennifer Xu, Matthew Meyn, Tom Goldstein, Vishal M. Patel

TL;DR

TransFIRA reframes face image quality assessment as recognizability predicted from the deployed encoder’s embedding space rather than visual proxies. By deriving recognizability labels from class-center similarities (CCS) and angular separation (CCAS) and training a lightweight predictor head, it yields encoder-specific, geometry-aligned scores. Recognizability-informed aggregation uses a natural CCAS>0 cutoff for filtering and CCS-based weighting to improve template verification, achieving state-of-the-art results on BRIAR and IJB-C with strong cross-dataset transfer and encoder-grounded explainability. The framework extends to body recognition via sigmoid calibration, demonstrating robust, modality-agnostic recognizability modeling that enhances accuracy and interpretability while remaining annotation-free.

Abstract

Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder's decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary--aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment -- encoder-specific, accurate, interpretable, and extensible across modalities -- significantly advancing FIQA in accuracy, explainability, and scope.

TransFIRA: Transfer Learning for Face Image Recognizability Assessment

TL;DR

TransFIRA reframes face image quality assessment as recognizability predicted from the deployed encoder’s embedding space rather than visual proxies. By deriving recognizability labels from class-center similarities (CCS) and angular separation (CCAS) and training a lightweight predictor head, it yields encoder-specific, geometry-aligned scores. Recognizability-informed aggregation uses a natural CCAS>0 cutoff for filtering and CCS-based weighting to improve template verification, achieving state-of-the-art results on BRIAR and IJB-C with strong cross-dataset transfer and encoder-grounded explainability. The framework extends to body recognition via sigmoid calibration, demonstrating robust, modality-agnostic recognizability modeling that enhances accuracy and interpretability while remaining annotation-free.

Abstract

Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder's decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary--aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment -- encoder-specific, accurate, interpretable, and extensible across modalities -- significantly advancing FIQA in accuracy, explainability, and scope.

Paper Structure

This paper contains 20 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of TransFIRA. Visual quality does not reliably predict recognizability; for instance, blurred faces may still be discriminative while clear ones may fail. TransFIRA avoids such proxies by deriving recognizability labels directly from class-center similarities ($CCS$, $NNCCS$, $CCAS$) and fine-tuning any pretrained encoder end-to-end with a prediction head. Scores remain tied to the encoder’s embedding space, enabling two key operations: recognizability-weighted aggregation using $CCS$ and principled filtering with the natural cutoff $CCAS>0$. This encoder-agnostic design improves robustness across challenging benchmarks (e.g., BRIAR Protocol 3.1 cornett2023expanding) and generalizes naturally to other modalities (Section \ref{['sec:method:body']}).
  • Figure 2: Overall ROC analysis. Top: template-level ROC comparisons across different IQA methods; for clarity, only the strongest variant of each is shown. Bottom: ablation study illustrating the individual and combined effects of CCAS-based filtering and CCS-based weighting. Metrics correspond to Table \ref{['tab:tar_results']}, and Average denotes uniform mean aggregation.
  • Figure 3: Image-level ERC curves at a target FMR of $\mathbf{10^{-3}}$. Curves closer to the bottom-left indicate better trade-offs between discarding fraction and FNMR. AUCs and Spearman correlations are reported in Table \ref{['tab:erc_auc_results']}, with stricter operating points ($10^{-4}$ and $10^{-6}$) reported in Appendix \ref{['sec:appendix:erc']}. For clarity, Only the strongest variant of each method is shown.
  • Figure 4: Sigmoid calibration of BRIAR cornett2023expanding recognizability labels for the SemReID zhu2022semreid body encoder. Left: raw CCS/NNCCS distributions (0.97 mean). Right: sigmoid calibrated distributions (0.50 mean). Appendix \ref{['sec:appendix:body']} reports full metrics, including raw and calibrated CCS/CCAS variants.
  • Figure 5: Explainability via Gaussian blur with ArcFace guo2021insightfacezhu2021webface260m on IJB-C maze2018iarpa Across all four examples, adding light blur slightly improves recognizability, as both CCS and CCAS increase relative to the no-blur condition. In contrast, moderate and heavy blur produce a monotonic decline in recognizability, with the sharpest drop under heavy blur. These patterns highlight that visual degradation does not map directly to recognizability: mild blur can reduce misalignment or suppress distractors, enhancing embeddings, while stronger blur consistently erodes discriminability. This illustrates the value of encoder-grounded signals for capturing nuanced, non-monotonic effects of perturbations.
  • ...and 1 more figures