High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

Christopher Li; Gary Wang; Kyle Kastner; Heng Su; Allen Chen; Andrew Rosenberg; Zhehuai Chen; Zelin Wu; Leonid Velikovich; Pat Rondon; Diamantino Caseiro; Petar Aleksic

High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

Christopher Li, Gary Wang, Kyle Kastner, Heng Su, Allen Chen, Andrew Rosenberg, Zhehuai Chen, Zelin Wu, Leonid Velikovich, Pat Rondon, Diamantino Caseiro, Petar Aleksic

TL;DR

This paper introduces a retrieval-based ASR correction system that uses multimodal speech-text embeddings to retrieve corrections directly from utterance audio, thereby eliminating hypothesis-audio mismatch. A MAESTRO-based shared encoder, coupled with a retrieval encoder trained in a dual-encoder setup, maps both audio and candidate corrections into a common embedding space for fast nearest-neighbor retrieval. Inference combines offline embedding of 128K candidates with a scoring rule that favors corrections closer to the spoken input, achieving a $6\%$ relative WER reduction on in-database utterances without degrading precision on general utterances. The approach scales to large correction databases and offers modular integration on top of frozen base ASR models, with demonstrated improvements in recall and maintained precision across test sets to support real-world voice-search applications.

Abstract

Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis text to correct and candidate corrections. However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. After locating an appropriate correction candidate using nearest-neighbor search, we score the candidate with its speech-text embedding distance before adding the candidate to the original n-best list. We show a relative word error rate (WER) reduction of 6% on utterances whose transcripts appear in the candidate set, without increasing WER on general utterances.

High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

TL;DR

relative WER reduction on in-database utterances without degrading precision on general utterances. The approach scales to large correction databases and offers modular integration on top of frozen base ASR models, with demonstrated improvements in recall and maintained precision across test sets to support real-world voice-search applications.

Abstract

Paper Structure (14 sections, 3 equations, 3 figures, 1 table)

This paper contains 14 sections, 3 equations, 3 figures, 1 table.

Introduction
Related work
System Design
Feature extraction with a pretrained MAESTRO model
Speech-text embeddings retrieval with a trained encoder
Inference with the shared encoder and retrieval encoder
Experiments
Base ASR architecture
Description of models
Building the embedding database
Test sets
Results
Conclusion
Acknowledgements

Figures (3)

Figure 1: Overall system. An offline job builds the embeddings database of retrievable phrases. The utterance audio is used to retrieve nearest neighbors. During N-best list expansion, the nearest neighbor phrase is scored and unioned with the original n-best list.
Figure 2: MAESTRO model semi-supervised training process.
Figure 3: Retrieval Encoder supervised training process.

High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

TL;DR

Abstract

High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

Authors

TL;DR

Abstract

Table of Contents

Figures (3)