Table of Contents
Fetching ...

Language Bias in Self-Supervised Learning For Automatic Speech Recognition

Edward Storey, Naomi Harte, Peter Bell

TL;DR

The paper investigates language bias in multilingual SSL ASR (XLS-R) by applying the Lottery Ticket Hypothesis to identify language-specific subnetworks. It demonstrates that, due to imbalanced pretraining data with English dominating, fine-tuning tends to rely on weights from the language with the largest data contribution, often English, across downstream languages. Through pruning experiments at $70\%$, $80\%$, and $90\%$ sparsity and IOU analyses, the authors show that English-derived subnetworks generally yield the lowest $CER$ and that non-English subnetworks contribute less effectively, even for linguistically related tasks. The findings highlight the importance of balancing pretraining data by language and linguistic relationships to mitigate bias and improve truly multilingual SSL ASR performance in open-source models.

Abstract

Self-supervised learning (SSL) is used in deep learning to train on large datasets without the need for expensive labelling of the data. Recently, large Automatic Speech Recognition (ASR) models such as XLS-R have utilised SSL to train on over one hundred different languages simultaneously. However, deeper investigation shows that the bulk of the training data for XLS-R comes from a small number of languages. Biases learned through SSL have been shown to exist in multiple domains, but language bias in multilingual SSL ASR has not been thoroughly examined. In this paper, we utilise the Lottery Ticket Hypothesis (LTH) to identify language-specific subnetworks within XLS-R and test the performance of these subnetworks on a variety of different languages. We are able to show that when fine-tuning, XLS-R bypasses traditional linguistic knowledge and builds only on weights learned from the languages with the largest data contribution to the pretraining data.

Language Bias in Self-Supervised Learning For Automatic Speech Recognition

TL;DR

The paper investigates language bias in multilingual SSL ASR (XLS-R) by applying the Lottery Ticket Hypothesis to identify language-specific subnetworks. It demonstrates that, due to imbalanced pretraining data with English dominating, fine-tuning tends to rely on weights from the language with the largest data contribution, often English, across downstream languages. Through pruning experiments at , , and sparsity and IOU analyses, the authors show that English-derived subnetworks generally yield the lowest and that non-English subnetworks contribute less effectively, even for linguistically related tasks. The findings highlight the importance of balancing pretraining data by language and linguistic relationships to mitigate bias and improve truly multilingual SSL ASR performance in open-source models.

Abstract

Self-supervised learning (SSL) is used in deep learning to train on large datasets without the need for expensive labelling of the data. Recently, large Automatic Speech Recognition (ASR) models such as XLS-R have utilised SSL to train on over one hundred different languages simultaneously. However, deeper investigation shows that the bulk of the training data for XLS-R comes from a small number of languages. Biases learned through SSL have been shown to exist in multiple domains, but language bias in multilingual SSL ASR has not been thoroughly examined. In this paper, we utilise the Lottery Ticket Hypothesis (LTH) to identify language-specific subnetworks within XLS-R and test the performance of these subnetworks on a variety of different languages. We are able to show that when fine-tuning, XLS-R bypasses traditional linguistic knowledge and builds only on weights learned from the languages with the largest data contribution to the pretraining data.

Paper Structure

This paper contains 16 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Training pipeline for all models XLS-R is fine-tuned to an upstream language. We then prune and train the downstream task for 10 epochs. If the upstream and downstream languages do not match we freeze the encoder and train for 1 extra epoch before unfreezing the encoder and training for 10 epochs
  • Figure 2: Upstream English to multiple downstream Languages an English upstream model is fine-tuned to downstream English, French, German, Polish and Spanish while pruning from 0% up to 90% sparsity
  • Figure 3: Upstream Spanish to five downstream languages the Spanish upstream model is fine-tuned to downstream English, French, German, Polish and Spanish at 70%, 80% and 90% sparsities
  • Figure 4: Upstream Polish to five downstream Languages the Polish upstream model is fine-tuned to downstream English, French, German, Polish and Spanish at 70%, 80% and 90% sparsities
  • Figure 5: Mean average results for each language-specific subnetwork when tested on all other downstream languages each subnetwork is fine-tuned to the four other languages, these results are then averaged and plotted at 70%, 80% and 90% sparsity
  • ...and 3 more figures