Table of Contents
Fetching ...

DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

Suraj Kothawade, Anmol Mekala, Chandra Sekhara D, Mayank Kothyari, Rishabh Iyer, Ganesh Ramakrishnan, Preethi Jyothi

TL;DR

This work tackles ASR performance disparities across accents under limited labeled data by introducing Ditto, a data-efficient and fair targeted subset selection framework. Ditto uses submodular mutual information to select a representative, target-relevant subset S from a large unlabeled pool U within a transcription budget, maximizing I_f(S;T) where T is a small target set, and B constrains the total transcription time. The method supports single- and multi-accent targeting, with FLMI and GCMI providing trade-offs between relevance and diversity, and achieves 3-5x label efficiency on IndicTTS and L2-Arctic compared with baselines. Empirical results demonstrate that SMI-based selection outperforms baselines on WER reductions, with FLMI generally offering better fairness across accents, highlighting practical impact for deploying ASR systems in multi-accent settings.

Abstract

State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn) that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that DITTO is 3-5 times more label-efficient than other speech selection methods on the IndicTTS and L2 datasets.

DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

TL;DR

This work tackles ASR performance disparities across accents under limited labeled data by introducing Ditto, a data-efficient and fair targeted subset selection framework. Ditto uses submodular mutual information to select a representative, target-relevant subset S from a large unlabeled pool U within a transcription budget, maximizing I_f(S;T) where T is a small target set, and B constrains the total transcription time. The method supports single- and multi-accent targeting, with FLMI and GCMI providing trade-offs between relevance and diversity, and achieves 3-5x label efficiency on IndicTTS and L2-Arctic compared with baselines. Empirical results demonstrate that SMI-based selection outperforms baselines on WER reductions, with FLMI generally offering better fairness across accents, highlighting practical impact for deploying ASR systems in multi-accent settings.

Abstract

State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn) that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that DITTO is 3-5 times more label-efficient than other speech selection methods on the IndicTTS and L2 datasets.

Paper Structure

This paper contains 16 sections, 6 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: ASR Accent Adaptation using Ditto.
  • Figure 2: WER reductions across a range of budgets, targeting Assamese (from IndicTTS) and Chinese (from L2-Arctic) accents.
  • Figure 3: Variation in selections and average WER across a range of budgets, targeting ASM and MAL together.
  • Figure 4: Variation in selections and average WER across a range of budgets, targeting CHN and VTN together.
  • Figure 5: t-SNE visualising MFCC features of selections on the full IndicTTS dataset: comparing FLMI and GCMI. FLMI’s selections are representative of the query and spread over all clusters A, B, C and D. GCMI is biased towards the bigger cluster centers: it does not select from clusters B and C at all (thus selecting a lower Assamese fraction) and selects dense, unrepresentative clusters from the centers of A and D.