Table of Contents
Fetching ...

Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations

Riccardo Fogliato, Pratik Patil, Pietro Perona

TL;DR

This study targets confidence intervals for error rates in 1:1 matching tasks, such as FRR and FAR, where errors are rare and test data exhibit dependence. It analyzes parametric (Gaussian/Wilson-type) and nonparametric bootstrap methods, derives asymptotic normality for the scaled error-rate estimators under a balanced identity regime, and provides consistent variance estimators. Empirical results on synthetic data and the MORPH dataset show that Wilson intervals with dependence-adjusted variance reliably achieve nominal coverage, while some bootstrap variants can under- or over-cover depending on the metric and sample size. The paper culminates in practical recommendations and an open-source library to implement CI constructions for 1:1 matching tasks, informing rigorous uncertainty quantification in verification systems.

Abstract

Matching algorithms are commonly used to predict matches between items in a collection. For example, in 1:1 face verification, a matching algorithm predicts whether two face images depict the same person. Accurately assessing the uncertainty of the error rates of such algorithms can be challenging when data are dependent and error rates are low, two aspects that have been often overlooked in the literature. In this work, we review methods for constructing confidence intervals for error rates in 1:1 matching tasks. We derive and examine the statistical properties of these methods, demonstrating how coverage and interval width vary with sample size, error rates, and degree of data dependence on both analysis and experiments with synthetic and real-world datasets. Based on our findings, we provide recommendations for best practices for constructing confidence intervals for error rates in 1:1 matching tasks.

Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations

TL;DR

This study targets confidence intervals for error rates in 1:1 matching tasks, such as FRR and FAR, where errors are rare and test data exhibit dependence. It analyzes parametric (Gaussian/Wilson-type) and nonparametric bootstrap methods, derives asymptotic normality for the scaled error-rate estimators under a balanced identity regime, and provides consistent variance estimators. Empirical results on synthetic data and the MORPH dataset show that Wilson intervals with dependence-adjusted variance reliably achieve nominal coverage, while some bootstrap variants can under- or over-cover depending on the metric and sample size. The paper culminates in practical recommendations and an open-source library to implement CI constructions for 1:1 matching tasks, informing rigorous uncertainty quantification in verification systems.

Abstract

Matching algorithms are commonly used to predict matches between items in a collection. For example, in 1:1 face verification, a matching algorithm predicts whether two face images depict the same person. Accurately assessing the uncertainty of the error rates of such algorithms can be challenging when data are dependent and error rates are low, two aspects that have been often overlooked in the literature. In this work, we review methods for constructing confidence intervals for error rates in 1:1 matching tasks. We derive and examine the statistical properties of these methods, demonstrating how coverage and interval width vary with sample size, error rates, and degree of data dependence on both analysis and experiments with synthetic and real-world datasets. Based on our findings, we provide recommendations for best practices for constructing confidence intervals for error rates in 1:1 matching tasks.
Paper Structure (48 sections, 7 theorems, 38 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 48 sections, 7 theorems, 38 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Assume that $\lim_{G\rightarrow\infty}{\mathop{\mathrm{Var}}}(\sqrt{G}\widehat{\mathtt{FRR}})=c_{\mathtt{FRR}}$ and $\lim_{G\rightarrow\infty}{\mathop{\mathrm{Var}}}(\sqrt{G}\widehat{\mathtt{FAR}})=c_{\mathtt{FAR}}$ for some positive constants $c_\mathtt{FRR}, c_\mathtt{FAR}$. Then, as $G \to \infty

Figures (9)

  • Figure 1: Different methods for constructing confidence intervals can lead to different conclusions due to miscoverage. Six methods for computing estimates and corresponding 95% confidence intervals on synthetic data for the false accept rate ($\mathtt{FAR}$) of two 1:1 matching algorithms (A and B) that have underlying equal accuracy ($\mathtt{FAR}=10^{-1}$). The data contains 50 groups, with 5 images each, and all pairwise comparisons are considered in the estimation of the error metric (details in \ref{['sec:experiments']}). Dots and bars correspond to error estimates and corresponding confidence intervals. The naive Wilson, subsets bootstrap, and two-level bootstrap intervals may lead the practitioner to erroneously conclude that Algorithm A has inferior performance compared to Algorithm B -- while in our simulation they are equivalent. In our analysis and experiments we find that only Wilson intervals achieve nominal coverage in presence of low error rates \ref{['mot:low-error']} and sample dependence \ref{['mot:sample-dependence']}. Double-or-nothing and vertex bootstrap intervals also work well in settings characterized only by \ref{['mot:sample-dependence']}.
  • Figure 2: Estimated interval coverage of 95% confidence intervals for $\mathtt{FRR}$ and $\mathtt{FAR}$ on synthetic data in settings characterized by \ref{['mot:low-error']} and \ref{['mot:sample-dependence']}. The data contains 50 identities and 5 instances (e.g., images) per identity. Lines and shaded regions indicate estimated coverage and corresponding 95% confidence intervals respectively, for each method, across data replications. The dashed line indicates nominal coverage (95%). The experimental setup is described in \ref{['sec:experiments']}. Only Wilson's method (blue lines) guarantees accurate coverage across all experimental conditions.
  • Figure 3: Estimated interval coverage versus nominal coverage for $\mathtt{FRR}$ and $\mathtt{FAR}$ on synthetic data. Data contain $G=50$ identities with $M=5$ instances each. Colored lines and shaded bands indicate estimated coverage computed on $10^3$ independent data replications and corresponding 95% naive Wilson intervals for the coverage respectively. Ideally, estimated coverage would coincide with nominal coverage (black dashed line).
  • Figure 4: Estimated coverage of 95% confidence intervals for $\mathtt{FAR}$ versus sample size on synthetic data. Data contain $G$ identities (horizontal axis) with $M=5$ instances each.
  • Figure 5: Estimated interval coverage versus nominal coverage for $\mathtt{FRR}$ and $\mathtt{FAR}$ on MORPH. Samples were generated by resampling $G=50$ identities from the original dataset without replacement.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Proposition 1: Normality of scaled error rates
  • Proposition 2: Consistency of plug-in variance estimators
  • Proposition 3: Equivalence of plug-in and jackknife variance estimators
  • Proposition 4: Bias of subsets bootstrap estimators
  • Proposition 5: Bias of vertex bootstrap estimators
  • Proposition 6: Bias of double-or-nothing bootstrap estimators
  • Proposition 7: Consistency of bootstrap estimators for $\mathtt{FRR}$