Table of Contents
Fetching ...

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

TL;DR

This work presents a pilot study on streaming decoder-only automatic speech recognition using discrete speech units, introducing boundary tokens (BTI) and right-chunk attention to enable low-latency streaming while maintaining contextual modeling. It compares BTI with Text Token Insertion (TTI) and demonstrates that BTI yields superior CER performance on AISHELL-1 (reaching 5.9% with LLM initialization) and AISHELL-2 (7.2% without speed perturbation), approaching non-streaming decoder-only baselines. The study also shows that data augmentation, especially label smoothing, and the use of off-the-shelf LLMs for initialization can further improve performance. Overall, the results validate streaming decoder-only ASR with discrete units as a viable approach, with future work aimed at larger models, more languages, and alternative speech tokenizers.

Abstract

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts.

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

TL;DR

This work presents a pilot study on streaming decoder-only automatic speech recognition using discrete speech units, introducing boundary tokens (BTI) and right-chunk attention to enable low-latency streaming while maintaining contextual modeling. It compares BTI with Text Token Insertion (TTI) and demonstrates that BTI yields superior CER performance on AISHELL-1 (reaching 5.9% with LLM initialization) and AISHELL-2 (7.2% without speed perturbation), approaching non-streaming decoder-only baselines. The study also shows that data augmentation, especially label smoothing, and the use of off-the-shelf LLMs for initialization can further improve performance. Overall, the results validate streaming decoder-only ASR with discrete units as a viable approach, with future work aimed at larger models, more languages, and alternative speech tokenizers.

Abstract

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts.
Paper Structure (11 sections, 7 equations, 2 figures, 3 tables)

This paper contains 11 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The comparison of different methods for training discrete-token-based decoder-only Transformer for ASR. (a) non-streaming model: decoding after receiving the whole speech token; (b) Text token insertion (TTI) streaming model: inserts text tokens into speech token sequences directly under the guide of speech-to-text alignment; (3) Boundary token insertion (BTI) streaming model: insert "boundary tokens" into the discrete speech token sequence.
  • Figure 2: Example diagram of different attention mechanisms. The green blocks indicate the part of the LLM. The yellow triangle indicates the part of the attention area. (a) global attention (b) causal attention; (c) right-chunk attention.