Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks
Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke
TL;DR
This work addresses SLU robustness under ASR errors by representing ASR output as word confusion networks (WCNs) derived from lattices and feeding these representations into off-the-shelf LLMs via prompting, without fine-tuning. The study demonstrates that WCNs can bridge part of the gap between top-ASR hypotheses and oracle performance for both spoken question answering and intent classification, with notable gains when using GPT-3.5-turbo and appropriate prompting (including WCN instruction and posterior filtering). Key findings show model size and in-context learning design critically influence gains, with WCN-based prompts providing resilience across a range of ASR qualities, though benefits are limited for smaller models. Overall, the approach offers a simple, non-tuning method to enhance SLU performance in real-world, noisy ASR conditions, highlighting practical implications for deployed voice-enabled systems.
Abstract
In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR performance conditions and scrutinize the aspects of in-context learning which prove the most influential.
