Table of Contents
Fetching ...

Efficient Streaming LLM for Speech Recognition

Junteng Jia, Gil Keren, Wei Zhou, Egor Lakomkin, Xiaohui Zhang, Chunyang Wu, Frank Seide, Jay Mahadeokar, Ozlem Kalinli

TL;DR

SpeechLLM-XL is introduced, a linear scaling decoder-only model for streaming speech recognition that process audios in configurable chunks using limited attention window for reduced computation, and the text tokens for each audio chunk are generated auto-regressively until an EOS is predicted.

Abstract

Recent works have shown that prompting large language models with audio encodings can unlock speech recognition capabilities. However, existing techniques do not scale efficiently, especially while handling long form streaming audio inputs -- not only do they extrapolate poorly beyond the audio length seen during training, but they are also computationally inefficient due to the quadratic cost of attention. In this work, we introduce SpeechLLM-XL, a linear scaling decoder-only model for streaming speech recognition. We process audios in configurable chunks using limited attention window for reduced computation, and the text tokens for each audio chunk are generated auto-regressively until an EOS is predicted. During training, the transcript is segmented into chunks, using a CTC forced alignment estimated from encoder output. SpeechLLM-XL with 1.28 seconds chunk size achieves 2.7%/6.7% WER on LibriSpeech test clean/other, and it shows no quality degradation on long form utterances 10x longer than the training utterances.

Efficient Streaming LLM for Speech Recognition

TL;DR

SpeechLLM-XL is introduced, a linear scaling decoder-only model for streaming speech recognition that process audios in configurable chunks using limited attention window for reduced computation, and the text tokens for each audio chunk are generated auto-regressively until an EOS is predicted.

Abstract

Recent works have shown that prompting large language models with audio encodings can unlock speech recognition capabilities. However, existing techniques do not scale efficiently, especially while handling long form streaming audio inputs -- not only do they extrapolate poorly beyond the audio length seen during training, but they are also computationally inefficient due to the quadratic cost of attention. In this work, we introduce SpeechLLM-XL, a linear scaling decoder-only model for streaming speech recognition. We process audios in configurable chunks using limited attention window for reduced computation, and the text tokens for each audio chunk are generated auto-regressively until an EOS is predicted. During training, the transcript is segmented into chunks, using a CTC forced alignment estimated from encoder output. SpeechLLM-XL with 1.28 seconds chunk size achieves 2.7%/6.7% WER on LibriSpeech test clean/other, and it shows no quality degradation on long form utterances 10x longer than the training utterances.
Paper Structure (9 sections, 5 equations, 1 figure, 5 tables)

This paper contains 9 sections, 5 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Overview of the proposed model. (A) SpeechLLM-XL consists of an audio encoder, a LLM decoder, and a text embedding layer. The audio sequence is processed in static-length chunks, and the resulting audio encodings (denoted as alphabet) are interleaved with text embedding (denoted as numbers) according to audio-text alignment, and the entire sequence is fed into the LLM. The model is trained for next-token-prediction to generate text tokens for each chunk, plus an EOS token $ indicating the end-of-chunk. (B) We use a limited attention window in the LLM decoder to reduce computation. In this plot, the audio/text encodings in each chunk only attend to previous one chunk besides the current chunk (i.e. token 4 would attend to $\{a, b, c, d, 1, e, f, g, h, 2, 3, 4\}$). (C) During training, the audio-text alignment is computed using a CTC forced aligner to align audio encodings and text tokens.