Table of Contents
Fetching ...

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

TL;DR

The paper tackles the challenge of contextualizing end-to-end ASR by enabling user/developer-driven biasing through an editable bias list. It introduces a deep biasing framework with a bias encoder/decoder, a bias phrase index loss, and special tokens to detect bias phrases, complemented by a bias phrase boosted (BPB) beam search that leverages bias-probability during inference. Experimental results on Librispeech-960 and a Japanese in-house dataset show consistent improvements in bias-related metrics (B-WER and B-CER) with the BPB scheme providing additional gains and minimal impact on overall WER. The approach avoids retraining or external LMs, enabling practical, scalable contextualization for personalized or domain-specific terms across languages.

Abstract

End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

TL;DR

The paper tackles the challenge of contextualizing end-to-end ASR by enabling user/developer-driven biasing through an editable bias list. It introduces a deep biasing framework with a bias encoder/decoder, a bias phrase index loss, and special tokens to detect bias phrases, complemented by a bias phrase boosted (BPB) beam search that leverages bias-probability during inference. Experimental results on Librispeech-960 and a Japanese in-house dataset show consistent improvements in bias-related metrics (B-WER and B-CER) with the BPB scheme providing additional gains and minimal impact on overall WER. The approach avoids retraining or external LMs, enabling practical, scalable contextualization for personalized or domain-specific terms across languages.

Abstract

End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.
Paper Structure (16 sections, 15 equations, 5 figures, 3 tables)

This paper contains 16 sections, 15 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overall architecture of the proposed method, including the audio encoder, bias encoder, and bias decoder. The BPB beam search algorithm is used during inference.
  • Figure 2: Effect of the bias phrase index loss. The horizontal and vertical axes show token index $s$ and bias phrases in $\bm{B}$, respectively.
  • Figure 3: Effect of the decoding weight $\alpha_{\text{bonus}}$ of the BPB beam search on Librispeech-960.
  • Figure 4: Typical example. Bolded faces, red and blue faces represent the bias phrases, incorrectly and correctly recognized, respectively.
  • Figure :