Transformer-based Model for ASR N-Best Rescoring and Rewriting
Iwen E. Kang, Christophe Van Gysel, Man-Hung Siu
TL;DR
The paper tackles on-device ASR limitations by leveraging the full N-best hypothesis context to simultaneously rescore and rewrite queries, aiming to improve accuracy for knowledge-domain requests while preserving user privacy. It introduces the Transformer Rescore Attention (TRA) model and a discriminative Matching Query Similarity Distribution (MQSD) loss, enabling joint rescoring and rewriting without exposing acoustic inputs. Empirical results show TRA variants outperform a rescore-only baseline and even a traditional 4-gram LM, with up to 8.6% relative WER reduction on in-domain music queries and additional gains when interpolating with ASR signals. The work demonstrates practical on-device speech understanding enhancements and provides a pathway for privacy-conscious, context-aware ASR refinement in real-world assistants.
Abstract
Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.
