Table of Contents
Fetching ...

Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems

Bo Ren, Yu Shi, Jinyu Li

TL;DR

End-to-end ASR systems still struggle with rare and domain-specific terms. The authors propose a lightweight Prompt Biasing approach that integrates contextual information via a unified multi-task Transformer framework and a decoding-time entity-filtering step, without architectural changes. The method yields substantial improvements in $EWER$ on in-house domain data (relative reductions of 30.7% for small lists and 18.0% for large lists) and demonstrates robustness to noise while preserving competitive $WER$. This approach offers an efficient, scalable way to inject domain-specific context into Transformer-based ASR for real-world deployment.

Abstract

End-to-End Automatic Speech Recognition (ASR) has advanced significantly yet still struggles with rare and domain-specific entities. This paper introduces a simple yet efficient prompt-based biasing technique for contextualized ASR, enhancing recognition accuracy by leverage a unified multitask learning framework. The approach comprises two key components: a prompt biasing model which is trained to determine when to focus on entities in prompt, and a entity filtering mechanism which efficiently filters out irrelevant entities. Our method significantly enhances ASR accuracy on entities, achieving a relative 30.7% and 18.0% reduction in Entity Word Error Rate compared to the baseline model with shallow fusion on in-house domain dataset with small and large entity lists, respectively. The primary advantage of this method lies in its efficiency and simplicity without any structure change, making it lightweight and highly efficient.

Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems

TL;DR

End-to-end ASR systems still struggle with rare and domain-specific terms. The authors propose a lightweight Prompt Biasing approach that integrates contextual information via a unified multi-task Transformer framework and a decoding-time entity-filtering step, without architectural changes. The method yields substantial improvements in on in-house domain data (relative reductions of 30.7% for small lists and 18.0% for large lists) and demonstrates robustness to noise while preserving competitive . This approach offers an efficient, scalable way to inject domain-specific context into Transformer-based ASR for real-world deployment.

Abstract

End-to-End Automatic Speech Recognition (ASR) has advanced significantly yet still struggles with rare and domain-specific entities. This paper introduces a simple yet efficient prompt-based biasing technique for contextualized ASR, enhancing recognition accuracy by leverage a unified multitask learning framework. The approach comprises two key components: a prompt biasing model which is trained to determine when to focus on entities in prompt, and a entity filtering mechanism which efficiently filters out irrelevant entities. Our method significantly enhances ASR accuracy on entities, achieving a relative 30.7% and 18.0% reduction in Entity Word Error Rate compared to the baseline model with shallow fusion on in-house domain dataset with small and large entity lists, respectively. The primary advantage of this method lies in its efficiency and simplicity without any structure change, making it lightweight and highly efficient.

Paper Structure

This paper contains 16 sections, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the proposed Prompt Biasing method. (a) Standard Transformer ASR training, where the model predicts the next token from previous tokens. (b) Unified multi-task training for contextual biasing, where the model predicts the next token using both previous tokens and contextual information provided as a prompt. (c) Multi-task token format for Prompt Biasing training: biasing and non-biasing tasks are indicated by special task tokens (<hit>/<miss>), and the biasing list is included as a prompt starting with the <sop> token.
  • Figure 2: Decoding process for the Prompt Biasing model. The model is trained to predict <hit> or <miss> for each sub-word in the prompt, enabling effective filtering of irrelevant entities. During decoding, entity filtering (see Algorithm \ref{['alg:filtering']}) greatly reduces the biasing list size.