Table of Contents
Fetching ...

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou, Jiawei Zhou, Karen Livescu

TL;DR

Flow-SLM introduces a textless spoken language model that jointly generates linguistic content and continuous acoustic information by pairing semantic tokens with a continuous acoustic vector. It leverages a conditional flow matching (CFM) objective to predict the velocity field conditioned on the semantic tokens and historical context, enabling ODE-based waveform generation without a separate vocoder-only pipeline. The approach achieves competitive linguistic likelihood benchmarks while delivering improved acoustic fidelity and speaker preservation, using substantially less compute and data than some prior models. Future work will explore scaling, joint speech-text modeling, and direct comparison with RVQ-based acoustic tokens to further understand tradeoffs between semantics and acoustics.

Abstract

Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

TL;DR

Flow-SLM introduces a textless spoken language model that jointly generates linguistic content and continuous acoustic information by pairing semantic tokens with a continuous acoustic vector. It leverages a conditional flow matching (CFM) objective to predict the velocity field conditioned on the semantic tokens and historical context, enabling ODE-based waveform generation without a separate vocoder-only pipeline. The approach achieves competitive linguistic likelihood benchmarks while delivering improved acoustic fidelity and speaker preservation, using substantially less compute and data than some prior models. Future work will explore scaling, joint speech-text modeling, and direct comparison with RVQ-based acoustic tokens to further understand tradeoffs between semantics and acoustics.

Abstract

Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

Paper Structure

This paper contains 22 sections, 11 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Left: The Flow-SLM architecture (see Sec. \ref{['sec:fm']}). The encoder maps the waveform into semantic tokens $z$ and continuous embeddings $x$. The causal transformer maps the sequence of embeddings $x$ into a context vector $c$. At each timestep $m$, the token predictor predicts the semantic tokens $z_{m:m+k-1}$ given the context vector $c_{<m}$. The CFM head predicts the vector field $v_t$ from optimal transport conditional flow $\phi^{OT}_{m,t}$ (Sec. \ref{['sec:fm']}), conditioning on the semantic tokens $z_{m:m+k-1}$ and the context vector $c_{<m}$. Right: The inference process per timestep for the CFM head. During inference, at each timestep, an embedding for the current timestep is generated with an ODE solver. The ODE solver iteratively takes $x_t$ and the velocity from the CFM head ($v_t$) to generate the next $x_{t+\Delta t}$. After the whole embedding sequence is produced, it is decoded to a waveform by the vocoder.
  • Figure 2: Compute and data comparison across models. Following hoffmann2022training, we estimate the theoretical FLOPs as $6 *N_{param}*D_{tokens}$, where $N_{param}$ is the number of training parameters excluding embeddings and $D_{tokens}$ is the number of training tokens. For SpiritLM, we include the total amount of speech data, including both speech-only and speech-text data.