Table of Contents
Fetching ...

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang

TL;DR

The paper addresses the limitation of Loss Masking in decoder-only, discrete-token ASR for unified speech-text models by proposing Smoothed Label Distillation (SLD), which adds a KL-divergence term with smoothed speech-label targets to the standard multimodal CE objective. SLD explicitly models dependencies among speech tokens, using $q'(x_t|x_{<t})$ smoothed targets and a final loss $\mathcal{L}_{SLD} = \mathcal{L}_{CE\_text} + \mathcal{L}_{CE\_speech} + \alpha \mathcal{L}_{KL\_speech}$. Experiments on LibriSpeech with different discretizations (WavLM/HuBERT) show that SLD consistently improves ASR performance, achieving up to 9% relative WER reductions on test-clean and 4% on test-other, and generally outperforming Loss Masking and Multimodal CE. The approach is robust to discretization noise and demonstrates the potential of Smoothed Label Distillation for improving decoder-only, discrete-token-based ASR and related unified speech-text tasks.

Abstract

Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

TL;DR

The paper addresses the limitation of Loss Masking in decoder-only, discrete-token ASR for unified speech-text models by proposing Smoothed Label Distillation (SLD), which adds a KL-divergence term with smoothed speech-label targets to the standard multimodal CE objective. SLD explicitly models dependencies among speech tokens, using smoothed targets and a final loss . Experiments on LibriSpeech with different discretizations (WavLM/HuBERT) show that SLD consistently improves ASR performance, achieving up to 9% relative WER reductions on test-clean and 4% on test-other, and generally outperforming Loss Masking and Multimodal CE. The approach is robust to discretization noise and demonstrates the potential of Smoothed Label Distillation for improving decoder-only, discrete-token-based ASR and related unified speech-text tasks.

Abstract

Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld
Paper Structure (10 sections, 6 equations, 3 figures, 1 table)

This paper contains 10 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The comparison of different methods for training discrete-token-based decoder-only Transformer for ASR. (1) Loss Masking masks the loss for speech tokens and uses cross-entropy loss for text tokens; (2) Multimodal Cross-Entropy Loss uses cross-entropy loss for both speech and text tokens, based on the outputs of the model and the hard labels; (3) Our proposed Smoothed Label Distillation (SLD) adds a KL divergence loss between the model outputs and the smoothed labels for speech tokens, on top of the Multimodal Cross-Entropy Loss.
  • Figure 2: The relationship between the WER of the dev-clean and dev-other datasets and the varying weight parameter $\alpha$ for the KL divergence loss component.
  • Figure 3: Comparison of loss curves for $\mathcal{L}_{\text{CE\_text}}$ on dev-other.