Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie
TL;DR
This work presents a pilot study on streaming decoder-only automatic speech recognition using discrete speech units, introducing boundary tokens (BTI) and right-chunk attention to enable low-latency streaming while maintaining contextual modeling. It compares BTI with Text Token Insertion (TTI) and demonstrates that BTI yields superior CER performance on AISHELL-1 (reaching 5.9% with LLM initialization) and AISHELL-2 (7.2% without speed perturbation), approaching non-streaming decoder-only baselines. The study also shows that data augmentation, especially label smoothing, and the use of off-the-shelf LLMs for initialization can further improve performance. Overall, the results validate streaming decoder-only ASR with discrete units as a viable approach, with future work aimed at larger models, more languages, and alternative speech tokenizers.
Abstract
Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts.
