Cross-utterance ASR Rescoring with Graph-based Label Propagation

Srinath Tankasala; Long Chen; Andreas Stolcke; Anirudh Raju; Qianli Deng; Chander Chandak; Aparna Khare; Roland Maas; Venkatesh Ravichandran

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran

TL;DR

This paper tackles majoritarian bias in ASR by moving beyond single-utterance rescoring and leveraging cross-utterance acoustic similarity. It introduces a graph-based label propagation framework that operates on a finite label set derived from the union of $N$-best hypotheses and uses a DTW-based distance $d{-}DTW_{norm}$ over RNN-T encoder embeddings to form utterance graphs. Soft, probabilistic labels initialized from the baseline model propagate through the graph to jointly rescore across utterances, enabling label sharing across similar utterances and potentially recovering hypotheses outside the initial $N$-best list. Experiments on LibriSpeech and VCTK demonstrate improved WER and SER across accents, highlighting the method’s ability to mitigate biases and enhance offline ASR performance without domain-specific model adaptation.

Abstract

We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.

Cross-utterance ASR Rescoring with Graph-based Label Propagation

TL;DR

-best hypotheses and uses a DTW-based distance

over RNN-T encoder embeddings to form utterance graphs. Soft, probabilistic labels initialized from the baseline model propagate through the graph to jointly rescore across utterances, enabling label sharing across similar utterances and potentially recovering hypotheses outside the initial

-best list. Experiments on LibriSpeech and VCTK demonstrate improved WER and SER across accents, highlighting the method’s ability to mitigate biases and enhance offline ASR performance without domain-specific model adaptation.

Abstract

Paper Structure (17 sections, 4 equations, 1 figure, 5 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 1 figure, 5 tables, 1 algorithm.

Introduction
Proposed Method
Problem setup
Utterance-utterance distance modeling
Graph construction
Label propagation
Graph-LP for cross-utterance ASR rescoring
Experiments
Datasets
Baseline and embedding generation model
Metric selection for utterance-utterance distance
Graph-LP experiments
Results
Baseline model results
EER and metric selection results
...and 2 more sections

Figures (1)

Figure 1: t-SNE visualization of utterance-utterance distances. Dots represent utterances in embedding space, with color and shape coding the transcript and accent of an utterance, respectively. (a) Euclidean distance based on last-frame embeddings. (b) d-DTW distance based on all-frames embeddings.

Cross-utterance ASR Rescoring with Graph-based Label Propagation

TL;DR

Abstract

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)