Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Yui Sudo; Yosuke Fukumoto; Muhammad Shakeel; Yifan Peng; Shinji Watanabe

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji Watanabe

TL;DR

A dynamic vocabulary where bias tokens can be added during inference is proposed where each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens.

Abstract

Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external language model shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies within the bias phrases. This method is easily applied to various architectures because it only expands the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the bias phrase WER on English and Japanese datasets by 3.1 -- 4.9 points compared with the conventional DB method.

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

TL;DR

A dynamic vocabulary where bias tokens can be added during inference is proposed where each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens.

Abstract

Paper Structure (18 sections, 23 equations, 5 figures, 3 tables)

This paper contains 18 sections, 23 equations, 5 figures, 3 tables.

Introduction
End-to-end ASR
Audio encoder
Decoder
Proposed method
Bias encoder
Expanded decoder with dynamic vocabulary
Application to hybrid E2E-ASR systems
Training
Bias weight during inference
Experiment
Experimental setup
Results of the offline CTC/attention-based system
Analysis of the proposed bias token
Effect of bias weight during inference
...and 3 more sections

Figures (5)

Figure 1: (a) Overall architecture of the proposed method, including the audio encoder, bias encoder, and decoder, with the expanded embedding and output layers. (b) Expanded embedding layer: If the input token is a dynamic bias token, the corresponding embedding $\bm{v}_n$ is extracted. (c) Expanded output layer: The bias score $\bm{\alpha}^{\text{b}}$ is calculated using the inner product.
Figure 2: Various architectures utilized in the proposed method.
Figure 3: Example of cumulative log probability during beam search.
Figure 4: Effect of the bias weight $\mu$.
Figure 5: Typical inference example. The characters in boldface, red, and blue represent the bias phrases, incorrectly, and correctly recognized characters, respectively.

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

TL;DR

Abstract

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Authors

TL;DR

Abstract

Table of Contents

Figures (5)