Table of Contents
Fetching ...

Transformer-based Model for ASR N-Best Rescoring and Rewriting

Iwen E. Kang, Christophe Van Gysel, Man-Hung Siu

TL;DR

The paper tackles on-device ASR limitations by leveraging the full N-best hypothesis context to simultaneously rescore and rewrite queries, aiming to improve accuracy for knowledge-domain requests while preserving user privacy. It introduces the Transformer Rescore Attention (TRA) model and a discriminative Matching Query Similarity Distribution (MQSD) loss, enabling joint rescoring and rewriting without exposing acoustic inputs. Empirical results show TRA variants outperform a rescore-only baseline and even a traditional 4-gram LM, with up to 8.6% relative WER reduction on in-domain music queries and additional gains when interpolating with ASR signals. The work demonstrates practical on-device speech understanding enhancements and provides a pathway for privacy-conscious, context-aware ASR refinement in real-world assistants.

Abstract

Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.

Transformer-based Model for ASR N-Best Rescoring and Rewriting

TL;DR

The paper tackles on-device ASR limitations by leveraging the full N-best hypothesis context to simultaneously rescore and rewrite queries, aiming to improve accuracy for knowledge-domain requests while preserving user privacy. It introduces the Transformer Rescore Attention (TRA) model and a discriminative Matching Query Similarity Distribution (MQSD) loss, enabling joint rescoring and rewriting without exposing acoustic inputs. Empirical results show TRA variants outperform a rescore-only baseline and even a traditional 4-gram LM, with up to 8.6% relative WER reduction on in-domain music queries and additional gains when interpolating with ASR signals. The work demonstrates practical on-device speech understanding enhancements and provides a pathway for privacy-conscious, context-aware ASR refinement in real-world assistants.

Abstract

Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.
Paper Structure (19 sections, 4 equations, 1 figure, 3 tables)

This paper contains 19 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Transformer Rescore Attention (TRA) model. During training: the Target, N-best and query similarity scores are fed to TRA, the Target sequence is shifted right as input for the decoder stack to compute per-token cross-entropy loss, the query similarity scores are used as input against the N-best predicted scores to compute MQSD loss. At inference time: the predicted "Target" is used to compute the N-best predicted scores for rescoring, the predicted output text can also be used to override the 1-best if its sequence loss exceeds a threshold.