CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

Martijn Bartelds; Ananjan Nandi; Moussa Koulako Bala Doumbouya; Dan Jurafsky; Tatsunori Hashimoto; Karen Livescu

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

Martijn Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu

TL;DR

CTC-DRO targets language disparities in multilingual ASR where group DRO fails due to CTC length scaling and irreducible losses. It introduces length-matched batching and a smoothed maximization objective for group weights, formulating a generalized DRO that remains effective when group losses are not directly comparable. On ML-SUPERB 2.0 across five language sets, CTC-DRO reduces the worst-language CER by up to 47.1% and the average CER by up to 32.9%, with minimal computational overhead. The work suggests broad applicability to other sequence tasks with variable-length inputs where standard group DRO is brittle.

Abstract

Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

TL;DR

Abstract

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)