Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition
Jean-Rémy Conti, Stéphan Clémençon
TL;DR
This work addresses uncertainty quantification for ROC-based evaluation of similarity scoring in Face Recognition, where both accuracy and fairness are critical. It develops a dedicated recentered bootstrap to construct confidence bands for the ROC curve and associated fairness metrics, leveraging the generalized $U$-statistic structure of $FAR$ and $FRR$. The authors prove asymptotic validity of the recentered bootstrap and introduce a scalar uncertainty measure $U[ ext{ROC}]$ to compare robustness across metrics. Empirical results on MORPH and RFW datasets show that naive bootstrap underestimates the ROC and that the recentered approach achieves nominal coverage, guiding more trustworthy fairness comparisons. Collectively, the method provides practical, statistically sound tools for decision-making in FR systems under uncertainty and regulatory scrutiny.
Abstract
The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.
