Table of Contents
Fetching ...

Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions

Jiwon Suh, Injae Na, Woohwan Jung

TL;DR

This paper tackles the challenge of domain-specific word recognition in end-to-end ASR by leveraging the Whisper model without architectural changes and introducing textual descriptions as prompts to the decoder. It proposes decoder-only fine-tuning and a context perturbation strategy to maintain generalization while exploiting descriptions, and it uses LLM-generated descriptions when human descriptions are unavailable. Experiments on Earnings Call and MIT OCW datasets show that decoder-focused training plus context perturbation improves domain-specific Word Error Rate, with LLM-generated descriptions often outperforming collected ones due to more targeted content. The approach avoids additional modules, reduces data requirements, and demonstrates practical gains in niche domains by effectively integrating textual context through prompting and lightweight fine-tuning.

Abstract

End-to-end automatic speech recognition (E2E ASR) systems have significantly improved speech recognition through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. Moreover, we propose two additional training techniques to improve the domain specific ASR: decoder fine-tuning, and context perturbation. We also propose a method to use a Large Language Model (LLM) to generate descriptions with simple metadata, when descriptions are unavailable. Our experiments demonstrate that proposed methods notably enhance domain-specific ASR accuracy on real-life datasets, with LLM-generated descriptions outperforming human-crafted ones in effectiveness.

Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions

TL;DR

This paper tackles the challenge of domain-specific word recognition in end-to-end ASR by leveraging the Whisper model without architectural changes and introducing textual descriptions as prompts to the decoder. It proposes decoder-only fine-tuning and a context perturbation strategy to maintain generalization while exploiting descriptions, and it uses LLM-generated descriptions when human descriptions are unavailable. Experiments on Earnings Call and MIT OCW datasets show that decoder-focused training plus context perturbation improves domain-specific Word Error Rate, with LLM-generated descriptions often outperforming collected ones due to more targeted content. The approach avoids additional modules, reduces data requirements, and demonstrates practical gains in niche domains by effectively integrating textual context through prompting and lightweight fine-tuning.

Abstract

End-to-end automatic speech recognition (E2E ASR) systems have significantly improved speech recognition through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. Moreover, we propose two additional training techniques to improve the domain specific ASR: decoder fine-tuning, and context perturbation. We also propose a method to use a Large Language Model (LLM) to generate descriptions with simple metadata, when descriptions are unavailable. Our experiments demonstrate that proposed methods notably enhance domain-specific ASR accuracy on real-life datasets, with LLM-generated descriptions outperforming human-crafted ones in effectiveness.
Paper Structure (12 sections, 1 figure, 4 tables)