Table of Contents
Fetching ...

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran

TL;DR

This paper tackles majoritarian bias in ASR by moving beyond single-utterance rescoring and leveraging cross-utterance acoustic similarity. It introduces a graph-based label propagation framework that operates on a finite label set derived from the union of $N$-best hypotheses and uses a DTW-based distance $d{-}DTW_{norm}$ over RNN-T encoder embeddings to form utterance graphs. Soft, probabilistic labels initialized from the baseline model propagate through the graph to jointly rescore across utterances, enabling label sharing across similar utterances and potentially recovering hypotheses outside the initial $N$-best list. Experiments on LibriSpeech and VCTK demonstrate improved WER and SER across accents, highlighting the method’s ability to mitigate biases and enhance offline ASR performance without domain-specific model adaptation.

Abstract

We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.

Cross-utterance ASR Rescoring with Graph-based Label Propagation

TL;DR

This paper tackles majoritarian bias in ASR by moving beyond single-utterance rescoring and leveraging cross-utterance acoustic similarity. It introduces a graph-based label propagation framework that operates on a finite label set derived from the union of -best hypotheses and uses a DTW-based distance over RNN-T encoder embeddings to form utterance graphs. Soft, probabilistic labels initialized from the baseline model propagate through the graph to jointly rescore across utterances, enabling label sharing across similar utterances and potentially recovering hypotheses outside the initial -best list. Experiments on LibriSpeech and VCTK demonstrate improved WER and SER across accents, highlighting the method’s ability to mitigate biases and enhance offline ASR performance without domain-specific model adaptation.

Abstract

We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.
Paper Structure (17 sections, 4 equations, 1 figure, 5 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 1 figure, 5 tables, 1 algorithm.

Figures (1)

  • Figure 1: t-SNE visualization of utterance-utterance distances. Dots represent utterances in embedding space, with color and shape coding the transcript and accent of an utterance, respectively. (a) Euclidean distance based on last-frame embeddings. (b) d-DTW distance based on all-frames embeddings.