Table of Contents
Fetching ...

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

Wiebke Hutiri, Tanvina Patel, Aaron Yi Ding, Odette Scharenborg

TL;DR

The paper addresses how bias evaluations in speaker verification depend on metric choices, leading to inconsistent conclusions across studies. It introduces a formal framework with base metrics such as $EER$ and $minCDet$, bias measures such as $G2min Diff$, $G2avg Ratio$, and $G2avg log Ratio$, and meta-measures such as $FDR$ and $NRB$, and applies it to a ResNet-34 SV model evaluated on VoxCeleb1-H/I across gender and nationality groups. The results show that the magnitude of base metrics can flip interpretations for difference-based measures, while ratio-based bias measures are magnitude-invariant, and that the two meta-measures can diverge, with $NRB$ capturing risk in low-$FPR$ regimes. The authors recommend adopting ratio-based bias measures and the $NRB$ meta-measure to enable fairer, more interpretable bias assessments and to guide safer deployment of speaker verification systems.

Abstract

Detecting and mitigating bias in speaker verification systems is important, as datasets, processing choices and algorithms can lead to performance differences that systematically favour some groups of people while disadvantaging others. Prior studies have thus measured performance differences across groups to evaluate bias. However, when comparing results across studies, it becomes apparent that they draw contradictory conclusions, hindering progress in this area. In this paper we investigate how measurement impacts the outcomes of bias evaluations. We show empirically that bias evaluations are strongly influenced by base metrics that measure performance, by the choice of ratio or difference-based bias measure, and by the aggregation of bias measures into meta-measures. Based on our findings, we recommend the use of ratio-based bias measures, in particular when the values of base metrics are small, or when base metrics with different orders of magnitude need to be compared.

As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research

TL;DR

The paper addresses how bias evaluations in speaker verification depend on metric choices, leading to inconsistent conclusions across studies. It introduces a formal framework with base metrics such as and , bias measures such as , , and , and meta-measures such as and , and applies it to a ResNet-34 SV model evaluated on VoxCeleb1-H/I across gender and nationality groups. The results show that the magnitude of base metrics can flip interpretations for difference-based measures, while ratio-based bias measures are magnitude-invariant, and that the two meta-measures can diverge, with capturing risk in low- regimes. The authors recommend adopting ratio-based bias measures and the meta-measure to enable fairer, more interpretable bias assessments and to guide safer deployment of speaker verification systems.

Abstract

Detecting and mitigating bias in speaker verification systems is important, as datasets, processing choices and algorithms can lead to performance differences that systematically favour some groups of people while disadvantaging others. Prior studies have thus measured performance differences across groups to evaluate bias. However, when comparing results across studies, it becomes apparent that they draw contradictory conclusions, hindering progress in this area. In this paper we investigate how measurement impacts the outcomes of bias evaluations. We show empirically that bias evaluations are strongly influenced by base metrics that measure performance, by the choice of ratio or difference-based bias measure, and by the aggregation of bias measures into meta-measures. Based on our findings, we recommend the use of ratio-based bias measures, in particular when the values of base metrics are small, or when base metrics with different orders of magnitude need to be compared.
Paper Structure (11 sections, 2 equations, 2 figures, 4 tables)

This paper contains 11 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: FDR meta-measure for gender + nationality groups. The FDR is calculated for different $\alpha$ and for systems calibrated to thresholds that produce pre-determined $FPR_{avg}$. $\alpha=0$ only considers the FNR, while $\alpha=1$ only considers the FPR.
  • Figure 2: NRB meta-measure for gender + nationality groups. The meta-measure is calculated for different base metrics.