A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Dingdong Wang; Mingyu Cui; Dongchao Yang; Xueyuan Chen; Helen Meng

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Dingdong Wang, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng

TL;DR

A fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM reveals that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding.

Abstract

With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs.

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

TL;DR

Abstract

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)