Table of Contents
Fetching ...

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Hankun Wang, Haoran Wang, Yiwei Guo, Zhihan Li, Chenpeng Du, Kai Yu

TL;DR

The paper investigates why Speech Language Models underperform relative to text LLMs in generating semantically coherent speech. By systematically evolving the input modality from text to phone to speech, it isolates three factors: A) phonetic versus semantic content, B) longer speech sequences, and C) paralinguistic variability. The experiments show Factor C has the strongest adverse impact—especially on lexical modeling—followed by Factor B affecting syntax and semantics, while Factor A is comparatively minor. The findings emphasize robust lexical grounding as a prerequisite for higher-level semantics and propose directions to shorten sequences and introduce stronger semantic supervision to close the coherence gap for end-to-end SLMs.

Abstract

Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

TL;DR

The paper investigates why Speech Language Models underperform relative to text LLMs in generating semantically coherent speech. By systematically evolving the input modality from text to phone to speech, it isolates three factors: A) phonetic versus semantic content, B) longer speech sequences, and C) paralinguistic variability. The experiments show Factor C has the strongest adverse impact—especially on lexical modeling—followed by Factor B affecting syntax and semantics, while Factor A is comparatively minor. The findings emphasize robust lexical grounding as a prerequisite for higher-level semantics and propose directions to shorten sequences and introduce stronger semantic supervision to close the coherence gap for end-to-end SLMs.

Abstract

Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.

Paper Structure

This paper contains 23 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Results after training the same number of tokens (within the first epoch).
  • Figure 2: Accuracy results of internal layers outputs for all objective tasks.
  • Figure 3: Layer-wise accuracy changes for the sWUGGY task.