A large-scale study of performance and equity of commercial remote identity verification technologies across demographics

Kaniz Fatima; Michael Schuckers; Gerardo Cruz-Ortiz; Daqing Hou; Sandip Purnapatra; Tiffany Andrews; Ambuj Neupane; Brandeis Marshall; Stephanie Schuckers

A large-scale study of performance and equity of commercial remote identity verification technologies across demographics

Kaniz Fatima, Michael Schuckers, Gerardo Cruz-Ortiz, Daqing Hou, Sandip Purnapatra, Tiffany Andrews, Ambuj Neupane, Brandeis Marshall, Stephanie Schuckers

TL;DR

This paper addresses fairness in remote identity verification (RIdV) by conducting a full end-to-end scenario evaluation of five commercial RIdV solutions on 3,991 participants, focusing on the false negative rate ($FNR$) across demographics. It introduces bootstrap-based, 95% confidence bounds to test equity across race/ethnicity, gender, age, and Monk skin-tone groups, and reports both globally equitable performance and demographic-specific inequities among vendors. The key finding is that Marmot achieves equitable, low $FNR$ across all groups (mean ~10.5% with bounds ~6–15%), while Hedgehog performs poorly, and other vendors show targeted disparities (e.g., Black/African American or darker skin tones). The study demonstrates the importance of end-to-end, demographically diverse testing for RIdV and provides a methodological framework for regulatory and procurement use, aiming to improve access to government services while reducing bias in remote identity processes.

Abstract

As more types of transactions move online, there is an increasing need to verify someone's identity remotely. Remote identity verification (RIdV) technologies have emerged to fill this need. RIdV solutions typically use a smart device to validate an identity document like a driver's license by comparing a face selfie to the face photo on the document. Recent research has been focused on ensuring that biometric systems work fairly across demographic groups. This study assesses five commercial RIdV solutions for equity across age, gender, race/ethnicity, and skin tone across 3,991 test subjects. This paper employs statistical methods to discern whether the RIdV result across demographic groups is statistically distinguishable. Two of the RIdV solutions were equitable across all demographics, while two RIdV solutions had at least one demographic that was inequitable. For example, the results for one technology had a false negative rate of 10.5% +/- 4.5% and its performance for each demographic category was within the error bounds, and, hence, were equitable. The other technologies saw either poor overall performance or inequitable performance. For one of these, participants of the race Black/African American (B/AA) as well as those with darker skin tones (Monk scale 7/8/9/10) experienced higher false rejections. Finally, one technology demonstrated more favorable but inequitable performance for the Asian American and Pacific Islander (AAPI) demographic. This study confirms that it is necessary to evaluate products across demographic groups to fully understand the performance of remote identity verification technologies.

A large-scale study of performance and equity of commercial remote identity verification technologies across demographics

TL;DR

) across demographics. It introduces bootstrap-based, 95% confidence bounds to test equity across race/ethnicity, gender, age, and Monk skin-tone groups, and reports both globally equitable performance and demographic-specific inequities among vendors. The key finding is that Marmot achieves equitable, low

across all groups (mean ~10.5% with bounds ~6–15%), while Hedgehog performs poorly, and other vendors show targeted disparities (e.g., Black/African American or darker skin tones). The study demonstrates the importance of end-to-end, demographically diverse testing for RIdV and provides a methodological framework for regulatory and procurement use, aiming to improve access to government services while reducing bias in remote identity processes.

Abstract

Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Introduction
Background
Fairness Metrics and Statistical Methods
Data Collection
Statistical Methodology
Results
Discussion
Conclusion
Acknowledgments

Figures (4)

Figure 1: Typical process by which a user interacts with remote identity verification technologies on their own devices via a webpage or a software app
Figure 2: The Monk scale includes 10 color shades to describe human skin color Skin-Lightening
Figure 3: False Negative Rates (FNR) for each demographic for each vendor. Error rates that are outside of the 95% confidence bounds are highlighted in pink if they are above the bounds and in blue if they are below the bounds. The number of subjects (N) for each vendor is provided in the first row and the number of subjects for each demographic group is provided in the first column. The difference in N was due to removal of some subjects from vendors.
Figure 4: False negative rate (FNR) for vendors for race/ethnicity (\ref{['fig:race']}), gender (\ref{['fig:gender']}), age (\ref{['fig:age']}) and skin tone based on the Monk scale (\ref{['fig:skintone']}). The upper and lower bounds for the error bounds (95% confidence interval) are indicated by dark gray bars. Blue and pink symbols indicate demographic categories that are below and above the bound, respectively.

A large-scale study of performance and equity of commercial remote identity verification technologies across demographics

TL;DR

Abstract

A large-scale study of performance and equity of commercial remote identity verification technologies across demographics

Authors

TL;DR

Abstract

Table of Contents

Figures (4)