Table of Contents
Fetching ...

A Neural Model for Contextual Biasing Score Learning and Filtering

Wanting Huang, Weiran Wang

TL;DR

The paper tackles contextual biasing in ASR by introducing an attention-based biasing decoder that scores candidate phrases from acoustic features. It trains this decoder with a two-part loss, combining phrase-level likelihood and a discriminative objective to promote true phrases over distractors, controlled by a beta hyperparameter. At inference, the model filters large biasing lists using per-token scores and tol thresholds, computing a per-token bonus as a maximum over remaining phrases and applying it during shallow fusion. Experiments on the Librispeech biasing benchmark show substantial reductions in B-WER and competitive WER gains while dramatically reducing the number of phrases considered, highlighting the approach's modularity, efficiency, and potential applicability to other biasing strategies.

Abstract

Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.

A Neural Model for Contextual Biasing Score Learning and Filtering

TL;DR

The paper tackles contextual biasing in ASR by introducing an attention-based biasing decoder that scores candidate phrases from acoustic features. It trains this decoder with a two-part loss, combining phrase-level likelihood and a discriminative objective to promote true phrases over distractors, controlled by a beta hyperparameter. At inference, the model filters large biasing lists using per-token scores and tol thresholds, computing a per-token bonus as a maximum over remaining phrases and applying it during shallow fusion. Experiments on the Librispeech biasing benchmark show substantial reductions in B-WER and competitive WER gains while dramatically reducing the number of phrases considered, highlighting the approach's modularity, efficiency, and potential applicability to other biasing strategies.

Abstract

Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.

Paper Structure

This paper contains 14 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of our method.
  • Figure 2: B-WERs of our method on dev sets, across different values of $N$.
  • Figure 3: Number of active phrases of our method on dev sets, across different values of $N$.