Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

Jean-Rémy Conti; Stéphan Clémençon

Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

Jean-Rémy Conti, Stéphan Clémençon

TL;DR

This work addresses uncertainty quantification for ROC-based evaluation of similarity scoring in Face Recognition, where both accuracy and fairness are critical. It develops a dedicated recentered bootstrap to construct confidence bands for the ROC curve and associated fairness metrics, leveraging the generalized $U$-statistic structure of $FAR$ and $FRR$. The authors prove asymptotic validity of the recentered bootstrap and introduce a scalar uncertainty measure $U[ ext{ROC}]$ to compare robustness across metrics. Empirical results on MORPH and RFW datasets show that naive bootstrap underestimates the ROC and that the recentered approach achieves nominal coverage, guiding more trustworthy fairness comparisons. Collectively, the method provides practical, statistically sound tools for decision-making in FR systems under uncertainty and regulatory scrutiny.

Abstract

The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.

Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

TL;DR

-statistic structure of

and

. The authors prove asymptotic validity of the recentered bootstrap and introduce a scalar uncertainty measure

to compare robustness across metrics. Empirical results on MORPH and RFW datasets show that naive bootstrap underestimates the ROC and that the recentered approach achieves nominal coverage, guiding more trustworthy fairness comparisons. Collectively, the method provides practical, statistically sound tools for decision-making in FR systems under uncertainty and regulatory scrutiny.

Abstract

Paper Structure (37 sections, 5 theorems, 52 equations, 22 figures, 2 tables, 3 algorithms)

This paper contains 37 sections, 5 theorems, 52 equations, 22 figures, 2 tables, 3 algorithms.

Introduction
Background and Preliminaries
Similarity Scoring - Probabilistic and Statistical Framework
ROC Analysis - Evaluation of Performance/Fairness in Similarity Scoring
Similarity Scoring Metrics - Assessing Uncertainty
Statistical Inference - Consistency Result
Bootstrapping the Performance/Fairness Metrics - Confidence Regions
Numerical Experiments - Applications
Conclusion
Further Remarks
Fairness Metrics
Max-min ratio.
Max-geomean ratio.
Log-geomean sum.
Gini coefficient.
...and 22 more sections

Key Result

Proposition 1

(Strong consistency) With probability one, we have:

Figures (22)

Figure 1: Empirical $\mathrm{ROC}$ curves for three different models (ArcFace, CosFace, AdaCos) and for two distinct evaluation datasets (see \ref{['subsec:real_experiments']}). The $\mathrm{ROC}$ curves for the first dataset are depicted with solid lines while the $\mathrm{ROC}$ curves for the second dataset are displayed with dashed lines. A confidence band for the $\mathrm{ROC}$ computed with the ArcFace model on the first dataset is displayed in light blue.
Figure 2: Bootstrap versions of the ROC curve ($\widehat{\mathrm{ROC}}_n^*$ in light blue) and the empirical ROC curve ($\widehat{\mathrm{ROC}}_n$ in dark blue). The V-statistic counterpart $\widetilde{\mathrm{ROC}}_n$ is depicted in red.
Figure 3: Confidence bands at $95$% confidence level for the empirical $\mathrm{ROC}$ curve (dark blue), using two methods: the naive bootstrap (light red) and the recentered bootstrap (light blue).
Figure 4: Confidence bands at $95$% confidence level for the $\mathrm{FRR}_{\mathrm{min}}^\mathrm{max}$ fairness metric, for two models (ArcFace, AdaCos). The empirical fairness metrics are depicted as solid lines.
Figure 5: Normalized uncertainty of several fairness metrics ($\mathrm{FAR}$ fairness in solid lines, $\mathrm{FRR}$ fairness in dashed lines). The gender label is chosen as the sensitive attribute.
...and 17 more figures

Theorems & Definitions (10)

Remark 1
Proposition 1
Theorem 1
Remark 2
Definition 1
Theorem 2
Theorem 3
proof
Corollary 1
proof

Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

TL;DR

Abstract

Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (10)