Contextualization of ASR with LLM using phonetic retrieval-based augmentation

Zhihong Lei; Xingyu Na; Mingbin Xu; Ernest Pusateri; Christophe Van Gysel; Yuanyuan Zhang; Shiyi Han; Zhen Huang

Contextualization of ASR with LLM using phonetic retrieval-based augmentation

Zhihong Lei, Xingyu Na, Mingbin Xu, Ernest Pusateri, Christophe Van Gysel, Yuanyuan Zhang, Shiyi Han, Zhen Huang

TL;DR

This work proposes a retrievalbased solution to contextualize the LLM: first let the LLM detect named entities in speech without any context, then use this named entity as a query to retrieve phonetically similar named entities from a personal database and feed them to the LLM, and finally run context-aware LLM decoding.

Abstract

Large language models (LLMs) have shown superb capability of modeling multimodal signals including audio and text, allowing the model to generate spoken or textual response given a speech input. However, it remains a challenge for the model to recognize personal named entities, such as contacts in a phone book, when the input modality is speech. In this work, we start with a speech recognition task and propose a retrieval-based solution to contextualize the LLM: we first let the LLM detect named entities in speech without any context, then use this named entity as a query to retrieve phonetically similar named entities from a personal database and feed them to the LLM, and finally run context-aware LLM decoding. In a voice assistant task, our solution achieved up to 30.2% relative word error rate reduction and 73.6% relative named entity error rate reduction compared to a baseline system without contextualization. Notably, our solution by design avoids prompting the LLM with the full named entity database, making it highly efficient and applicable to large named entity databases.

Contextualization of ASR with LLM using phonetic retrieval-based augmentation

TL;DR

Abstract

Paper Structure (13 sections, 3 figures, 4 tables)

This paper contains 13 sections, 3 figures, 4 tables.

Introduction
Methodology
LLM-based ASR
Our method
Training process
Experiments
Data and evaluation metrics
Models
Results
Analysis
Less accurate named entity detection results of ne-full
Impact of the number of retrieved named entities
Conclusion

Figures (3)

Figure 1: Baseline LLM-ASR system. Audio features from a pretrained audio encoder are subsampled, projected and fed to the LLM for decoding.
Figure 2: Our three-step method: named entity detection, phonetic-based retrieval and context-aware generation. In the example above, < c> and < /c> indicate the start and the end of a contact named entity.
Figure 3: Example training source sequences for audios with and without detected named entities. Regions in grey serve as prompts to the LLM. Regions in blue and yellow serve as training targets for detection and generation.

Contextualization of ASR with LLM using phonetic retrieval-based augmentation

TL;DR

Abstract

Contextualization of ASR with LLM using phonetic retrieval-based augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)