Table of Contents
Fetching ...

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, Jie Tang

TL;DR

The work tackles the data bottleneck in scaling SpeechLMs by generating large-scale synthetic interleaved speech-text data from text corpora using a text-to-token model, paired with a supervised speech tokenizer to produce discrete, semantically meaningful tokens at 12.5 Hz. By pre-training on up to 1 trillion tokens (including 600B interleaved tokens) and then fine-tuning on speech dialogue data, the approach achieves state-of-the-art results in speech language modeling and spoken question answering while enabling end-to-end spoken chatbot capabilities. The method demonstrates that synthetic cross-modal data can substantially bridge text and speech modalities, reducing reliance on parallel corpora and achieving strong cross-domain performance with significantly less natural speech data than prior large-scale baselines. Extensive ablations highlight the impact of data scale, frame rate, and span corruption on performance, underscoring the importance of balancing efficiency and semantic fidelity in synthetic data generation.

Abstract

Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower frame rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

TL;DR

The work tackles the data bottleneck in scaling SpeechLMs by generating large-scale synthetic interleaved speech-text data from text corpora using a text-to-token model, paired with a supervised speech tokenizer to produce discrete, semantically meaningful tokens at 12.5 Hz. By pre-training on up to 1 trillion tokens (including 600B interleaved tokens) and then fine-tuning on speech dialogue data, the approach achieves state-of-the-art results in speech language modeling and spoken question answering while enabling end-to-end spoken chatbot capabilities. The method demonstrates that synthetic cross-modal data can substantially bridge text and speech modalities, reducing reliance on parallel corpora and achieving strong cross-domain performance with significantly less natural speech data than prior large-scale baselines. Extensive ablations highlight the impact of data scale, frame rate, and span corruption on performance, underscoring the importance of balancing efficiency and semantic fidelity in synthetic data generation.

Abstract

Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower frame rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.

Paper Structure

This paper contains 43 sections, 1 equation, 3 figures, 13 tables.

Figures (3)

  • Figure 1: (Left) The performance on Spoken QA continuously improves as the amount of synthetic interleaved data increases, significantly surpassing the previous SOTA (Moshi). (Right) The pipeline for synthesizing interleaved speech-text data.
  • Figure 2: Overview of our method. First we train a text-to-token model to construct interleaved speech-text data. The speech language model's training contains two stages. In the stage 1 the model is pre-trained with synthetic speech-text interleaved data. In the stage 2 the the model is fine-tuned with a speech dialogue dataset.
  • Figure 3: (a) Sampling rate vs average accuracy. (b) Span corruption ratio vs average accuracy. The accuracy is averaged over datasets of speech language modeling and spoken question answering. (c) Interleaved data tokens vs average performance after supervised fine-tuning.