Table of Contents
Fetching ...

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Takaaki Hori, Dogan Can, Xiaodan Zhuang

TL;DR

This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues and investigates using full cross-attention, LLM prompting, and a combination of the two.

Abstract

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

TL;DR

This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues and investigates using full cross-attention, LLM prompting, and a combination of the two.

Abstract

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.

Paper Structure

This paper contains 19 sections, 7 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Four considered biasing approaches: (a) vanilla contextual biasing referred to as Dense NCB; (b) Retrieval NCB, where a quantization module is used for large scale retrieval followed by TopK dense cross-attention processing; (c) LLM Prompting, where retrieved entries are packed into the prompt; (d) combination of Retrieval NCB and LLM Prompting. Note that in an efficient implementation, the context encoder needs to be run exactly once for each biasing entry and the encodings get cached and reused across multiple queries.
  • Figure 2: Retrieval success rates for Top1, Top5, and Top10, for various FSQ settings. The baseline numbers (w/o quantization) are given in the legend.
  • Figure 3: Average number of phrases retrieved per utterance for Top1, Top5, and Top10 retrieval, for various FSQ settings. The baseline (w/o quantization) numbers are given in the legend. The error bars represent one standard deviation from the averages.
  • Figure 4: Collision vs. Top1 retrieval success rate, for various FSQ settings.
  • Figure 5: Runtime analysis of baseline and proposed approaches across various biasing list sizes. The error bars represent one standard deviation from the averages, computed from all the queries in the same bin.
  • ...and 2 more figures