Table of Contents
Fetching ...

Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants

Chloé Sekkat, Fanny Leroy, Salima Mdhaffar, Blake Perry Smith, Yannick Estève, Joseph Dureau, Alice Coucke

TL;DR

The paper addresses demographic bias in voice assistants by introducing the Sonos Voice Control Bias Assessment Dataset, a large, demographically tagged corpus for music-domain commands, and by proposing a bias analysis framework that leverages SLU-level Exact Match metrics rather than transcription accuracy. It demonstrates the approach with state-of-the-art ASR (wav2vec2.0) and SLU (JointBERT) models, reporting statistically significant biases across age, dialectal region, and ethnicity, with multivariate analyses revealing complex interactions between dialect, gender, and age. By open-sourcing both the dataset and the statistical methodology, the work provides a concrete benchmark and toolkit to detect, quantify, and interpret demographic bias in end-to-end voice assistant systems. The findings highlight the importance of considering SLU performance and cross-demographic interactions for building more robust and inclusive voice interfaces in the music-domain use case, and they point to future work on broader demographic coverage and more challenging acoustic conditions.

Abstract

Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.

Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants

TL;DR

The paper addresses demographic bias in voice assistants by introducing the Sonos Voice Control Bias Assessment Dataset, a large, demographically tagged corpus for music-domain commands, and by proposing a bias analysis framework that leverages SLU-level Exact Match metrics rather than transcription accuracy. It demonstrates the approach with state-of-the-art ASR (wav2vec2.0) and SLU (JointBERT) models, reporting statistically significant biases across age, dialectal region, and ethnicity, with multivariate analyses revealing complex interactions between dialect, gender, and age. By open-sourcing both the dataset and the statistical methodology, the work provides a concrete benchmark and toolkit to detect, quantify, and interpret demographic bias in end-to-end voice assistant systems. The findings highlight the importance of considering SLU performance and cross-demographic interactions for building more robust and inclusive voice interfaces in the music-domain use case, and they point to future work on broader demographic coverage and more challenging acoustic conditions.

Abstract

Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.
Paper Structure (41 sections, 5 equations, 11 figures, 6 tables)

This paper contains 41 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Audio sample distribution in the test split of the dataset in terms of age, gender, and dialectal region. The number of samples in each group is displayed under the group label.
  • Figure 2: Exact Match Ratio (EMR) per speaker's demographic group. Points indicate individual speakers.
  • Figure 3: Interaction effect on the Exact Match Ratio (EMR) of age and dialectal region. By splitting across dialectal regions, the differences in EMR between age groups are getting wider compared to Figure \ref{['fig:emr-w2v']}(a).
  • Figure 4: Speaker distribution in the test split of the dataset in terms of age, gender, and dialectal region. The number of speakers in each group is displayed under the group label.
  • Figure 5: Audio sample distribution in the train split of the dataset in terms of age, gender, and dialectal region. The number of samples in each group is displayed under the group label.
  • ...and 6 more figures