Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Yui Sudo; Muhammad Shakeel; Yosuke Fukumoto; Yifan Peng; Shinji Watanabe

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

TL;DR

The paper tackles the challenge of contextualizing end-to-end ASR by enabling user/developer-driven biasing through an editable bias list. It introduces a deep biasing framework with a bias encoder/decoder, a bias phrase index loss, and special tokens to detect bias phrases, complemented by a bias phrase boosted (BPB) beam search that leverages bias-probability during inference. Experimental results on Librispeech-960 and a Japanese in-house dataset show consistent improvements in bias-related metrics (B-WER and B-CER) with the BPB scheme providing additional gains and minimal impact on overall WER. The approach avoids retraining or external LMs, enabling practical, scalable contextualization for personalized or domain-specific terms across languages.

Abstract

End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

TL;DR

Abstract

Paper Structure (16 sections, 15 equations, 5 figures, 3 tables)

This paper contains 16 sections, 15 equations, 5 figures, 3 tables.

Introduction
Attention-based encoder-decoder ASR
Audio encoder
Attention-based decoder
Proposed deep biasing method
Bias encoder
Bias decoder
Training
BPB beam search algorithm
Experiment
Experimental setup
Preliminary analysis of the proposed techniques
Main results
Analysis of the BPB beam search algorithm
Validation on Japanese dataset
...and 1 more sections

Figures (5)

Figure 1: Overall architecture of the proposed method, including the audio encoder, bias encoder, and bias decoder. The BPB beam search algorithm is used during inference.
Figure 2: Effect of the bias phrase index loss. The horizontal and vertical axes show token index $s$ and bias phrases in $\bm{B}$, respectively.
Figure 3: Effect of the decoding weight $\alpha_{\text{bonus}}$ of the BPB beam search on Librispeech-960.
Figure 4: Typical example. Bolded faces, red and blue faces represent the bias phrases, incorrectly and correctly recognized, respectively.
Figure :

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

TL;DR

Abstract

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Authors

TL;DR

Abstract

Table of Contents

Figures (5)