Table of Contents
Fetching ...

A Rapid Test for Accuracy and Bias of Face Recognition Technology

Manuel Knott, Ignacio Serna, Ethan Mann, Pietro Perona

TL;DR

This work tackles the challenge of accurately and affordably benchmarking 1:1 face verification across cloud FR services by eliminating manual ground-truth labeling. It introduces a rapid, annotation-free pipeline that sources recent web images for two datasets, infers ground-truth identities from ensemble embeddings via spectral factorization, and aggregates results across five cloud services to produce $FNMR$ vs. $FMR$ curves and bias metrics. Key contributions include an automated validation against hand-annotated labels, the first public multi-service benchmark of FR accuracy and demographic bias, and a framework that highlights biases such as poorer performance for Asian women in certain services. The approach promises to democratize FR testing, enabling rapid, scalable, privacy-conscious evaluation that informs developers, policymakers, and the public about accuracy and fairness in real-world deployments.

Abstract

Measuring the accuracy of face recognition (FR) systems is essential for improving performance and ensuring responsible use. Accuracy is typically estimated using large annotated datasets, which are costly and difficult to obtain. We propose a novel method for 1:1 face verification that benchmarks FR systems quickly and without manual annotation, starting from approximate labels (e.g., from web search results). Unlike previous methods for training set label cleaning, ours leverages the embedding representation of the models being evaluated, achieving high accuracy in smaller-sized test datasets. Our approach reliably estimates FR accuracy and ranking, significantly reducing the time and cost of manual labeling. We also introduce the first public benchmark of five FR cloud services, revealing demographic biases, particularly lower accuracy for Asian women. Our rapid test method can democratize FR testing, promoting scrutiny and responsible use of the technology. Our method is provided as a publicly accessible tool at https://github.com/caltechvisionlab/frt-rapid-test

A Rapid Test for Accuracy and Bias of Face Recognition Technology

TL;DR

This work tackles the challenge of accurately and affordably benchmarking 1:1 face verification across cloud FR services by eliminating manual ground-truth labeling. It introduces a rapid, annotation-free pipeline that sources recent web images for two datasets, infers ground-truth identities from ensemble embeddings via spectral factorization, and aggregates results across five cloud services to produce vs. curves and bias metrics. Key contributions include an automated validation against hand-annotated labels, the first public multi-service benchmark of FR accuracy and demographic bias, and a framework that highlights biases such as poorer performance for Asian women in certain services. The approach promises to democratize FR testing, enabling rapid, scalable, privacy-conscious evaluation that informs developers, policymakers, and the public about accuracy and fairness in real-world deployments.

Abstract

Measuring the accuracy of face recognition (FR) systems is essential for improving performance and ensuring responsible use. Accuracy is typically estimated using large annotated datasets, which are costly and difficult to obtain. We propose a novel method for 1:1 face verification that benchmarks FR systems quickly and without manual annotation, starting from approximate labels (e.g., from web search results). Unlike previous methods for training set label cleaning, ours leverages the embedding representation of the models being evaluated, achieving high accuracy in smaller-sized test datasets. Our approach reliably estimates FR accuracy and ranking, significantly reducing the time and cost of manual labeling. We also introduce the first public benchmark of five FR cloud services, revealing demographic biases, particularly lower accuracy for Asian women. Our rapid test method can democratize FR testing, promoting scrutiny and responsible use of the technology. Our method is provided as a publicly accessible tool at https://github.com/caltechvisionlab/frt-rapid-test

Paper Structure

This paper contains 27 sections, 1 equation, 18 figures, 1 table, 1 algorithm.

Figures (18)

  • Figure 1: Unsupervised accuracy estimates on five FR cloud services match supervised estimates. The plots show the False Non-Match Rate (FNMR) (equivalently, the False Reject Rate) vs. the False Match Rate (FMR) (equivalently, the False Accept Rate) of five commercial cloud services on two collections of face images: Celebrities and Athletes. The thin dark lines indicate accuracy as estimated by our automated method. The thick pale lines indicate the ground-truth estimates through human labeling, which took more than a month of human labor to produce. See also \ref{['sec:accuracy-and-bias']}.
  • Figure 2: Overview of our method. An operator provides a list of people's names that are used as queries for image URL sourcing from the internet. The images are accessed through their URL and are not stored. Several face recognition services are evaluated simultaneously (five in this study). Each service detects faces in the selected images and assigns a same-identity (or "match") confidence value to pairs of faces. From this information, an estimate of which faces belong to which identity is computed. Using this estimate FNMR-vs-FMR curves and bias estimates may be produced (\ref{['fig:accuracy', 'fig:bias-equal-error']}) to estimate the accuracy of each service. Our method does not require hand-annotation and estimates identity labels for each face image from the data. In this study hand-annotated labels were collected purely to validate our method and were not available to our method.
  • Figure 3: ID Label Estimation Method (\ref{['sec:gt_estimation']}). (Left column) Matrices showing the confidence values assigned by one of the FRT services to face pairs in queries $q=76, 111$. Each row and each column corresponds to a face image, and each matrix entry indicates the service's confidence that the corresponding pair of face images belongs to the same person (the indices have been rearranged to make the block structure apparent). Yellow indicates high confidence, and blue indicates low confidence. The top matrix has a single block, while the bottom one has two blocks, suggesting that two different identities with a significant number of images are associated with the query. (Second column) The top eigenvectors (singular vectors whose singular value exceeds a threshold) of the matrices, where the x-axis indicates the image index, act as indicator functions of which images are associated with each identity. (Third column) By thresholding the eigenvectors, the algorithm discovers which images belong to which identity. (Right column) In the top row, information from the eigenvector is combined with corresponding eigenvectors from other services by majority vote. The bottom row does not meet the criteria for inclusion since it contains more than one identity and is discarded from further consideration.
  • Figure 4: Sample of service output and evaluation. (Left) Distributions of confidence values for same-ID face pairs (pink) and different-ID pairs (blue) for the Face++ FRT service. (Mid) FMR and FNMR curves as a function of confidence thresholds. (Right) FMR-FNMR curves. Our method's estimate (thin dark lines) is close to the values obtained through hand-annotation (thick pale lines). These curves are incorporated in \ref{['fig:accuracy']} (left). Plots showing the same statistics separately for all datasets and all services may be found in Figs. \ref{['fig:results-all-celeb']}, \ref{['fig:results-all-athletes']}.
  • Figure 5: Measuring bias vis-a-vis gender and race or geographical area. Our method's (Estimate) vs hand-annotated (Annotation) equal error rate (FMR=FNMR) of the five services computed for each intersectional group defined by gender and race (Celebrities) or geographical area of country (Athletes). Our method correctly detects large biases (i.e., differences in accuracy across demographic groups): see the markedly higher error rates for Asian female celebrities in Amazon Recognition and Verigram, the two more accurate services. Detailed FMR-FNMR curves per demographic group are depicted in \ref{['fig:bias']}. Confidence intervals, computed using Wilson's method fogliato2024confidence, are about 3x larger than the markers (not shown to preserve visual clarity).
  • ...and 13 more figures