On The Landscape of Spoken Language Models: A Comprehensive Survey
Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
TL;DR
This survey maps the rapid emergence of universal spoken language processing by organizing SLMs into three core classes: pure speech LMs, speech-aware text LMs, and speech+text LMs. It provides a unified formulation of SLM architecture, detailing components such as speech encoders, modality adapters, and sequence models, and compares diverse tokenization schemes (phonetic vs audio codec) and generation strategies (hierarchical, interleaved, and text-speech hybrids). It covers training paradigms (generative pre-training, conditional pre-training, continual pre-training, instruction tuning, and chat-style post-training) and post-training strategies, including PEFT and preference optimization, while surveying representative models and duplex dialogue approaches. The paper also surveys evaluation frameworks—likelihood-based, generative, and trustworthiness metrics—and highlights major challenges in architecture, data, evaluation, openness, and safety, outlining directions for building robust, inclusive, and scalable SLMs. Overall, the work clarifies the current landscape, consolidates terminology, and identifies key barriers and opportunities toward truly universal, instruction-following spoken language systems with broad real-world impact.
Abstract
The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both "pure" language models of speech -- models of the distribution of tokenized speech sequences -- and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.
