Houston we have a Divergence: A Subgroup Performance Analysis of ASR Models
Alkis Koudounas, Flavio Giobergia
TL;DR
The paper analyzes why Apollo-era multi-speaker recordings exhibit ASR performance variability across subgroups by constructing a metadata-driven framework that groups recordings and measures WER divergence from the population. It evaluates Whisper variants in zero-shot and fine-tuned settings across English-only and multilingual configurations, linking subgroup characteristics to performance with Global Shapley Values. Key findings show that fine-tuning reduces subgroup divergence and that model-size effects are heterogeneous, while multilingual models offer selective gains for a subset of subgroups; metadata such as high SNR and low spectral flatness correlate with better WER. The work provides a practical methodology for diagnosing and mitigating subgroup disparities in Earth-to-space communications and informs targeted optimizations for historical multi-speaker ASR datasets.
Abstract
The Fearless Steps APOLLO Community Resource provides unparalleled opportunities to explore the potential of multi-speaker team communications from NASA Apollo missions. This study focuses on discovering the characteristics that make Apollo recordings more or less intelligible to Automatic Speech Recognition (ASR) methods. We extract, for each audio recording, interpretable metadata on recordings (signal-to-noise ratio, spectral flatness, presence of pauses, sentence duration), transcript (number of words spoken, speaking rate), or known a priori (speaker). We identify subgroups of audio recordings based on combinations of these metadata and compute each subgroup's performance (e.g., Word Error Rate) and the difference in performance (''divergence'') w.r.t the overall population. We then apply the Whisper model in different sizes, trained on English-only or multilingual datasets, in zero-shot or after fine-tuning. We conduct several analyses to (i) automatically identify and describe the most problematic subgroups for a given model, (ii) examine the impact of fine-tuning w.r.t. zero-shot at the subgroup level, (iii) understand the effect of model size on subgroup performance, and (iv) analyze if multilingual models are more sensitive than monolingual to subgroup performance disparities. The insights enhance our understanding of subgroup-specific performance variations, paving the way for advancements in optimizing ASR systems for Earth-to-space communications.
