As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

Wiebke Hutiri; Tanvina Patel; Aaron Yi Ding; Odette Scharenborg

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

Wiebke Hutiri, Tanvina Patel, Aaron Yi Ding, Odette Scharenborg

TL;DR

The paper addresses how bias evaluations in speaker verification depend on metric choices, leading to inconsistent conclusions across studies. It introduces a formal framework with base metrics such as $EER$ and $minCDet$, bias measures such as $G2min Diff$, $G2avg Ratio$, and $G2avg log Ratio$, and meta-measures such as $FDR$ and $NRB$, and applies it to a ResNet-34 SV model evaluated on VoxCeleb1-H/I across gender and nationality groups. The results show that the magnitude of base metrics can flip interpretations for difference-based measures, while ratio-based bias measures are magnitude-invariant, and that the two meta-measures can diverge, with $NRB$ capturing risk in low-$FPR$ regimes. The authors recommend adopting ratio-based bias measures and the $NRB$ meta-measure to enable fairer, more interpretable bias assessments and to guide safer deployment of speaker verification systems.

Abstract

Detecting and mitigating bias in speaker verification systems is important, as datasets, processing choices and algorithms can lead to performance differences that systematically favour some groups of people while disadvantaging others. Prior studies have thus measured performance differences across groups to evaluate bias. However, when comparing results across studies, it becomes apparent that they draw contradictory conclusions, hindering progress in this area. In this paper we investigate how measurement impacts the outcomes of bias evaluations. We show empirically that bias evaluations are strongly influenced by base metrics that measure performance, by the choice of ratio or difference-based bias measure, and by the aggregation of bias measures into meta-measures. Based on our findings, we recommend the use of ratio-based bias measures, in particular when the values of base metrics are small, or when base metrics with different orders of magnitude need to be compared.

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

TL;DR

and

, bias measures such as

, and

, and meta-measures such as

and

, and applies it to a ResNet-34 SV model evaluated on VoxCeleb1-H/I across gender and nationality groups. The results show that the magnitude of base metrics can flip interpretations for difference-based measures, while ratio-based bias measures are magnitude-invariant, and that the two meta-measures can diverge, with

capturing risk in low-

regimes. The authors recommend adopting ratio-based bias measures and the

meta-measure to enable fairer, more interpretable bias assessments and to guide safer deployment of speaker verification systems.

Abstract

Paper Structure (11 sections, 2 equations, 2 figures, 4 tables)

This paper contains 11 sections, 2 equations, 2 figures, 4 tables.

Introduction
Background and Related Work
Method
Bias and Meta-measures
Experiment Setup
Results
Impact of Base Metrics and Bias Measures
Impact of Meta-measures
Which meta-measure is correct?
Discussion and Limitations
Conclusion

Figures (2)

Figure 1: FDR meta-measure for gender + nationality groups. The FDR is calculated for different $\alpha$ and for systems calibrated to thresholds that produce pre-determined $FPR_{avg}$. $\alpha=0$ only considers the FNR, while $\alpha=1$ only considers the FPR.
Figure 2: NRB meta-measure for gender + nationality groups. The meta-measure is calculated for different base metrics.

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

TL;DR

Abstract

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

Authors

TL;DR

Abstract

Table of Contents

Figures (2)