Table of Contents
Fetching ...

Optimizing Speech Language Models for Acoustic Consistency

Morteza Rohanian, Michael Krauthammer

TL;DR

This study tackles robust and consistent speech generation by extending decoder-only language models to discrete speech units using a fixed tokenizer and frozen codec. The CAST framework introduces SSL-informed initialization of speech tokens, a light alignment loss, and robustness-oriented thinning and auxiliary planning losses to bias the model toward stable acoustic output. Experiments with three CAST variants show that a 0.7B speech-only model achieves the strongest acoustic consistency, while interleaving text and speech improves semantic grounding and alignment at the cost of stability. The work demonstrates that LM-side design and training mix can balance acoustic stability and semantic grounding without architectural changes, offering a scalable path for robust speech-enabled systems.

Abstract

We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.

Optimizing Speech Language Models for Acoustic Consistency

TL;DR

This study tackles robust and consistent speech generation by extending decoder-only language models to discrete speech units using a fixed tokenizer and frozen codec. The CAST framework introduces SSL-informed initialization of speech tokens, a light alignment loss, and robustness-oriented thinning and auxiliary planning losses to bias the model toward stable acoustic output. Experiments with three CAST variants show that a 0.7B speech-only model achieves the strongest acoustic consistency, while interleaving text and speech improves semantic grounding and alignment at the cost of stability. The work demonstrates that LM-side design and training mix can balance acoustic stability and semantic grounding without architectural changes, offering a scalable path for robust speech-enabled systems.

Abstract

We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.

Paper Structure

This paper contains 14 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Method overview. (a) Speech (via codec) and text (via BPE) are tokenized and interleaved. (b) Decoder-only LM predicts unified sequence. (c) SSL-initialized speech tokens plus auxiliary coarse and next-code objectives improve modeling. This design forms the CAST approach.