Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou; Jiawei Zhou; Karen Livescu

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou, Jiawei Zhou, Karen Livescu

TL;DR

Flow-SLM introduces a textless spoken language model that jointly generates linguistic content and continuous acoustic information by pairing semantic tokens with a continuous acoustic vector. It leverages a conditional flow matching (CFM) objective to predict the velocity field conditioned on the semantic tokens and historical context, enabling ODE-based waveform generation without a separate vocoder-only pipeline. The approach achieves competitive linguistic likelihood benchmarks while delivering improved acoustic fidelity and speaker preservation, using substantially less compute and data than some prior models. Future work will explore scaling, joint speech-text modeling, and direct comparison with RVQ-based acoustic tokens to further understand tradeoffs between semantics and acoustics.

Abstract

Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

TL;DR

Abstract

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)