A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness

Oubaida Chouchane; Christoph Busch; Chiara Galdi; Nicholas Evans; Massimiliano Todisco

A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness

Oubaida Chouchane, Christoph Busch, Chiara Galdi, Nicholas Evans, Massimiliano Todisco

TL;DR

This paper addresses fairness in automatic speaker verification (ASV) by evaluating three candidate fairness metrics across multiple operating points. It compares FDR, IR, and GARBE, using five state-of-the-art ASV systems trained on VoxCeleb data and evaluated on a balanced, multi-national subset, following ISO/IEC DIS 19795-10 guidelines. The study finds that GARBE best satisfies the Functional Fairness Measure Criteria (FFMC), while FDR suffers from scale imbalances between FMR and FNMR differentials and IR is unbounded or incalculable in many cases, revealing a critical trade-off between fairness and verification performance. The work advocates fairness-by-design in ASV development and positions GARBE as a robust, interpretable metric for biometric fairness in practical deployments.

Abstract

When decisions are made and when personal data is treated by automated processes, there is an expectation of fairness -- that members of different demographic groups receive equitable treatment. This expectation applies to biometric systems such as automatic speaker verification (ASV). We present a comparison of three candidate fairness metrics and extend previous work performed for face recognition, by examining differential performance across a range of different ASV operating points. Results show that the Gini Aggregation Rate for Biometric Equitability (GARBE) is the only one which meets three functional fairness measure criteria. Furthermore, a comprehensive evaluation of the fairness and verification performance of five state-of-the-art ASV systems is also presented. Our findings reveal a nuanced trade-off between fairness and verification accuracy underscoring the complex interplay between system design, demographic inclusiveness, and verification reliability.

A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness

TL;DR

Abstract

Paper Structure (23 sections, 7 equations, 7 figures, 2 tables)

This paper contains 23 sections, 7 equations, 7 figures, 2 tables.

Introduction
Fairness Metrics and Criteria
Fairness Discrepancy Rate
Inequity Rate
The Gini Aggregation Rate for Biometric Equitability
Functional Fairness Measure Criteria
Experimental setup
Speaker verification systems
Databases
Fairness evaluation procedure
Experimental results and discussion
Metrics evaluation results at a fixed threshold
FDR evaluation
IR evaluation
GARBE evaluation
...and 8 more sections

Figures (7)

Figure 1: FDR values using 5 automatic speaker verification systems at a threshold corresponding to FMR = 0.1%
Figure 2: FDR values using 5 automatic speaker verification systems at a range of thresholds corresponding to a FMR varying form 0.1% to 10%
Figure 3: GARBE values using 5 automatic speaker verification systems at a threshold corresponding to FMR = 0.1%
Figure 4: GARBE values using 5 automatic speaker verification systems at a range of thresholds corresponding to a FMR varying form 0.1% to 10%
Figure 5: IR values using 5 automatic speaker verification systems at a range of thresholds corresponding to a FMR varying form 0.1% to 10%
...and 2 more figures

A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness

TL;DR

Abstract

A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (7)