Table of Contents
Fetching ...

Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

Jean-Rémy Conti, Stéphan Clémençon

TL;DR

This work addresses uncertainty quantification for ROC-based evaluation of similarity scoring in Face Recognition, where both accuracy and fairness are critical. It develops a dedicated recentered bootstrap to construct confidence bands for the ROC curve and associated fairness metrics, leveraging the generalized $U$-statistic structure of $FAR$ and $FRR$. The authors prove asymptotic validity of the recentered bootstrap and introduce a scalar uncertainty measure $U[ ext{ROC}]$ to compare robustness across metrics. Empirical results on MORPH and RFW datasets show that naive bootstrap underestimates the ROC and that the recentered approach achieves nominal coverage, guiding more trustworthy fairness comparisons. Collectively, the method provides practical, statistically sound tools for decision-making in FR systems under uncertainty and regulatory scrutiny.

Abstract

The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.

Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

TL;DR

This work addresses uncertainty quantification for ROC-based evaluation of similarity scoring in Face Recognition, where both accuracy and fairness are critical. It develops a dedicated recentered bootstrap to construct confidence bands for the ROC curve and associated fairness metrics, leveraging the generalized -statistic structure of and . The authors prove asymptotic validity of the recentered bootstrap and introduce a scalar uncertainty measure to compare robustness across metrics. Empirical results on MORPH and RFW datasets show that naive bootstrap underestimates the ROC and that the recentered approach achieves nominal coverage, guiding more trustworthy fairness comparisons. Collectively, the method provides practical, statistically sound tools for decision-making in FR systems under uncertainty and regulatory scrutiny.

Abstract

The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.
Paper Structure (37 sections, 5 theorems, 52 equations, 22 figures, 2 tables, 3 algorithms)

This paper contains 37 sections, 5 theorems, 52 equations, 22 figures, 2 tables, 3 algorithms.

Key Result

Proposition 1

(Strong consistency) With probability one, we have:

Figures (22)

  • Figure 1: Empirical $\mathrm{ROC}$ curves for three different models (ArcFace, CosFace, AdaCos) and for two distinct evaluation datasets (see \ref{['subsec:real_experiments']}). The $\mathrm{ROC}$ curves for the first dataset are depicted with solid lines while the $\mathrm{ROC}$ curves for the second dataset are displayed with dashed lines. A confidence band for the $\mathrm{ROC}$ computed with the ArcFace model on the first dataset is displayed in light blue.
  • Figure 2: Bootstrap versions of the ROC curve ($\widehat{\mathrm{ROC}}_n^*$ in light blue) and the empirical ROC curve ($\widehat{\mathrm{ROC}}_n$ in dark blue). The V-statistic counterpart $\widetilde{\mathrm{ROC}}_n$ is depicted in red.
  • Figure 3: Confidence bands at $95$% confidence level for the empirical $\mathrm{ROC}$ curve (dark blue), using two methods: the naive bootstrap (light red) and the recentered bootstrap (light blue).
  • Figure 4: Confidence bands at $95$% confidence level for the $\mathrm{FRR}_{\mathrm{min}}^\mathrm{max}$ fairness metric, for two models (ArcFace, AdaCos). The empirical fairness metrics are depicted as solid lines.
  • Figure 5: Normalized uncertainty of several fairness metrics ($\mathrm{FAR}$ fairness in solid lines, $\mathrm{FRR}$ fairness in dashed lines). The gender label is chosen as the sensitive attribute.
  • ...and 17 more figures

Theorems & Definitions (10)

  • Remark 1
  • Proposition 1
  • Theorem 1
  • Remark 2
  • Definition 1
  • Theorem 2
  • Theorem 3
  • proof
  • Corollary 1
  • proof