Transformer-based Model for ASR N-Best Rescoring and Rewriting

Iwen E. Kang; Christophe Van Gysel; Man-Hung Siu

Transformer-based Model for ASR N-Best Rescoring and Rewriting

Iwen E. Kang, Christophe Van Gysel, Man-Hung Siu

TL;DR

The paper tackles on-device ASR limitations by leveraging the full N-best hypothesis context to simultaneously rescore and rewrite queries, aiming to improve accuracy for knowledge-domain requests while preserving user privacy. It introduces the Transformer Rescore Attention (TRA) model and a discriminative Matching Query Similarity Distribution (MQSD) loss, enabling joint rescoring and rewriting without exposing acoustic inputs. Empirical results show TRA variants outperform a rescore-only baseline and even a traditional 4-gram LM, with up to 8.6% relative WER reduction on in-domain music queries and additional gains when interpolating with ASR signals. The work demonstrates practical on-device speech understanding enhancements and provides a pathway for privacy-conscious, context-aware ASR refinement in real-world assistants.

Abstract

Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.

Transformer-based Model for ASR N-Best Rescoring and Rewriting

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 1 figure, 3 tables)

This paper contains 19 sections, 4 equations, 1 figure, 3 tables.

Introduction
Models
Transformer Rescorer (TR) Model
Transformer Rescore Attention (TRA) Model
Rescore Attention Layer
Matching Query Similarity Distribution (MQSD) Loss
Experimental Setup
ASR system
Training and evaluation data
Training data
Evaluation sets
Rescoring/rewriting methods under comparison
Transformers
4-gram LM with Katz back-off
Interpolation with ASR decoding signals
...and 4 more sections

Figures (1)

Figure 1: Transformer Rescore Attention (TRA) model. During training: the Target, N-best and query similarity scores are fed to TRA, the Target sequence is shifted right as input for the decoder stack to compute per-token cross-entropy loss, the query similarity scores are used as input against the N-best predicted scores to compute MQSD loss. At inference time: the predicted "Target" is used to compute the N-best predicted scores for rescoring, the predicted output text can also be used to override the 1-best if its sequence loss exceeds a threshold.

Transformer-based Model for ASR N-Best Rescoring and Rewriting

TL;DR

Abstract

Transformer-based Model for ASR N-Best Rescoring and Rewriting

Authors

TL;DR

Abstract

Table of Contents

Figures (1)