Table of Contents
Fetching ...

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Dingdong Wang, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng

TL;DR

A fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM reveals that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding.

Abstract

With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs.

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

TL;DR

A fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM reveals that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding.

Abstract

With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs.

Paper Structure

This paper contains 15 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Architectures of two approaches for integrating speech into Large Language Models (LLMs): discrete token-based encoding versus continuous feature processing.
  • Figure 2: Total training time until convergence for discrete tokens and continuous features, with discrete token training time normalized to 1 for all datasets.
  • Figure 3: Frequency distribution of discrete tokens with a codebook size of 6000, based on the Gigaspeech M-size corpus. The red line indicates the 95% cumulative frequency threshold.
  • Figure 4: WER results of different layers based on Qwen1.5-0.5B with LibriSpeech 100-hour dataset.