Comparing Discrete and Continuous Space LLMs for Speech Recognition
Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu
TL;DR
This work systematically compares discrete and continuous speech representations within LLM-based automatic speech recognition, organizing representations into four categories based on supervision and data form, and evaluating them with two language-model backbones: a Joint-Training-From-Scratch Language Model (JTFS LM) and pretrained LLaMA2-7b. By employing specialized encoders (HuBERT and Whisper) and a range of modeling and training strategies, the study demonstrates that continuous representations generally yield lower WER than discrete ones under JTFS LM, and that LLaMA2 with TextInput strategies can achieve competitive, state-of-the-art open-source results on LibriSpeech (best around 1.6% clean/3.0% other). The key contributions include the first thorough cross-comparison of discrete vs continuous speech representations in LLM-based ASR, a detailed modeling blueprint for each combination, and open-source results that advance ASR and NLP research. The findings underscore the importance of representation fidelity, encoder choice, and pretraining benefits for effective integration of speech with large language models, with practical implications for building cost-efficient yet accurate speech interfaces.
Abstract
This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69\% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.
