Beyond WER: Probing Whisper's Sub-token Decoder Across Diverse Language Resource Levels
Siyu Liang, Nicolas Ballier, Gina-Anne Levow, Richard Wright
TL;DR
The paper addresses fairness and efficacy gaps in multilingual ASR by moving beyond aggregate Word Error Rate and probing Whisper's sub-token decoder. It retraces the final beam path to record per-step top-$K_{cand}$ sub-token candidates and probabilities, across 20 languages spanning resource levels, and analyzes metrics such as average rank, confidence, entropy, and alternative-candidate diversity. PCA and t-SNE on sub-token usage reveal typology-driven patterns, showing that high-resource languages enjoy stronger local decoding signals and more diverse hypotheses, while low-resource languages exhibit clustering tied to data scarcity. The work highlights practical interventions, including language-specific adapters and decoding adjustments, to address decoder-level biases and improve equitable performance in multilingual ASR. Overall, the study demonstrates that internal decoding dynamics offer actionable insights for reducing disparities beyond what WER alone captures.
Abstract
While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper's multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from higher likelihood of the correct token being top-ranked, greater confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA and t-SNE analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.
