Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over
Atsunori Ogawa, Naoyuki Kamo, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Takatomo Kano, Naohiro Tawara, Marc Delcroix
TL;DR
This work evaluates decoder-based LLMs for rescoring N-best ASR hypotheses of casual conversations using the CHiME-7 DASR dataset. It systematically compares Llama2-7B with a small Transformer LM (Slama2-70M) under varying context lengths (up to 1024 tokens) and with memory-efficient domain adaptation via QLoRA. Key findings show that Llama2 yields lower WER than the domain-adapted baseline even without adaptation, and that longer context captures conversational flow; domain adaptation reduces the required context length, lowering compute. The results offer practical guidance for deploying LLM-based rescoring in casual-speech scenarios and point to future work with larger LLMs and backward LMs to push performance further.
Abstract
Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LLMs, and the CHiME-7 DASR task provides datasets of casual conversations between multiple participants. We investigate the effects of domain adaptation of the LLM and context carry-over when performing N-best rescoring. Experimental results show that, even without domain adaptation, Llama2 outperforms a standard-size domain-adapted Transformer-LM, especially when using a long context. Domain adaptation shortens the context length needed with Llama2 to achieve its best performance, i.e., it reduces the computational cost of Llama2.
