Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Yosuke Higuchi; Tetsuji Ogawa; Tetsunori Kobayashi

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

TL;DR

The paper addresses improving end-to-end ASR by leveraging instruction-tuned LLMs to provide linguistic guidance during decoding. It introduces an LLM-guided decoder that uses zero-shot grammatical error correction prompts on CTC hypotheses and fuses linguistic information with acoustic representations through cross-attention. Training proceeds in two stages—first a joint CTC/attention ASR, then freezing the encoder and training the LLM-guided decoder with Llama2—yielding about a 13% relative reduction in word error rate across major benchmarks, with additional gains on data where unnormalized text proves beneficial. A notable limitation is the high computational cost, motivating future work on lightweight, efficient LLMs to maintain practical applicability while preserving the performance gains observed.

Abstract

We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific instruction. The decoder subsequently takes as input the LLM output to perform token predictions, combining acoustic information from the encoder and the powerful linguistic information provided by the LLM. Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13\% in word error rates across major benchmarks.

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

TL;DR

Abstract

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (1)