Cross-utterance ASR Rescoring with Graph-based Label Propagation
Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran
TL;DR
This paper tackles majoritarian bias in ASR by moving beyond single-utterance rescoring and leveraging cross-utterance acoustic similarity. It introduces a graph-based label propagation framework that operates on a finite label set derived from the union of $N$-best hypotheses and uses a DTW-based distance $d{-}DTW_{norm}$ over RNN-T encoder embeddings to form utterance graphs. Soft, probabilistic labels initialized from the baseline model propagate through the graph to jointly rescore across utterances, enabling label sharing across similar utterances and potentially recovering hypotheses outside the initial $N$-best list. Experiments on LibriSpeech and VCTK demonstrate improved WER and SER across accents, highlighting the method’s ability to mitigate biases and enhance offline ASR performance without domain-specific model adaptation.
Abstract
We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.
