Table of Contents
Fetching ...

LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition

Fan Yu, Haoxu Wang, Xian Shi, Shiliang Zhang

TL;DR

The paper addresses automatic transcription in audio-visual streams where synchronized slide text provides long-context biasing opportunities for rare phrases. It introduces LCB-net, a bi-encoder AVSR architecture with a dedicated biasing-prediction module and a contextual phrases simulation strategy to exploit long-context information from slides. Empirical results on SlideSpeech and LibriSpeech show consistent, substantial relative reductions in WER, U-WER, and B-WER compared to baselines, with additional gains from the BCE-based biasing predictor and BPE-level simulation. These findings demonstrate robust biasing-capable AVSR that can generalize across contexts and hold promise for deployment in real-world, slide-rich multimedia settings.

Abstract

The growing prevalence of online conferences and courses presents a new challenge in improving automatic speech recognition (ASR) with enriched textual information from video slides. In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. Specifically, we adopt a bi-encoder architecture to simultaneously model audio and long-context biasing. Besides, we also propose a biasing prediction module that utilizes binary cross entropy (BCE) loss to explicitly determine biased phrases in the long-context biasing. Furthermore, we introduce a dynamic contextual phrases simulation to enhance the generalization and robustness of our LCB-net. Experiments on the SlideSpeech, a large-scale audio-visual corpus enriched with slides, reveal that our proposed LCB-net outperforms general ASR model by 9.4%/9.1%/10.9% relative WER/U-WER/B-WER reduction on test set, which enjoys high unbiased and biased performance. Moreover, we also evaluate our model on LibriSpeech corpus, leading to 23.8%/19.2%/35.4% relative WER/U-WER/B-WER reduction over the ASR model.

LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition

TL;DR

The paper addresses automatic transcription in audio-visual streams where synchronized slide text provides long-context biasing opportunities for rare phrases. It introduces LCB-net, a bi-encoder AVSR architecture with a dedicated biasing-prediction module and a contextual phrases simulation strategy to exploit long-context information from slides. Empirical results on SlideSpeech and LibriSpeech show consistent, substantial relative reductions in WER, U-WER, and B-WER compared to baselines, with additional gains from the BCE-based biasing predictor and BPE-level simulation. These findings demonstrate robust biasing-capable AVSR that can generalize across contexts and hold promise for deployment in real-world, slide-rich multimedia settings.

Abstract

The growing prevalence of online conferences and courses presents a new challenge in improving automatic speech recognition (ASR) with enriched textual information from video slides. In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. Specifically, we adopt a bi-encoder architecture to simultaneously model audio and long-context biasing. Besides, we also propose a biasing prediction module that utilizes binary cross entropy (BCE) loss to explicitly determine biased phrases in the long-context biasing. Furthermore, we introduce a dynamic contextual phrases simulation to enhance the generalization and robustness of our LCB-net. Experiments on the SlideSpeech, a large-scale audio-visual corpus enriched with slides, reveal that our proposed LCB-net outperforms general ASR model by 9.4%/9.1%/10.9% relative WER/U-WER/B-WER reduction on test set, which enjoys high unbiased and biased performance. Moreover, we also evaluate our model on LibriSpeech corpus, leading to 23.8%/19.2%/35.4% relative WER/U-WER/B-WER reduction over the ASR model.
Paper Structure (13 sections, 2 equations, 4 figures, 3 tables)

This paper contains 13 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An overview of the AV-ASR system.
  • Figure 2: (a) LCB-net (b) Biasing Prediction
  • Figure 3: An overview of contextual phrases simulation.
  • Figure 4: Attention score matrix of AC cross-attention. Brighter colors denote values closer to 1, while darker colors indicate values closer to 0. Red and blue mean the two biased phrases. Only the part around the bright bias are plotted due to large bias number.