Table of Contents
Fetching ...

Llama-Mimi: Exploring the Limits of Flattened Speech Language Modeling

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Ryuichiro Higashinaka

TL;DR

Llama-Mimi is proposed, which flattens multi-level RVQ tokens produced by the Mimi neural audio codec into a single sequence and models them autoregressively with a Transformer decoder, and shows that Llama-Mimi outperforms a CSM-based hierarchical model on most tasks and achieves the best performance on acoustic consistency.

Abstract

Speech Language Models (SpeechLMs) model tokenized speech to capture both semantic and acoustic information. When neural audio codecs based on Residual Vector Quantization (RVQ) are used as audio tokenizers, they produce multiple discrete tokens per time step, yielding inherently multi-level representations. To process these multi-level tokens together, prior work typically adopts hierarchical architectures to capture this structure. In contrast, recent progress in NLP has progressively reduced architectural inductive biases, moving toward simpler and more scalable single-Transformer architectures. In this work, we propose Llama-Mimi, which flattens multi-level RVQ tokens produced by the Mimi neural audio codec into a single sequence and models them autoregressively with a Transformer decoder. We show that Llama-Mimi outperforms a CSM-based hierarchical model on most tasks and achieves the best performance on acoustic consistency. Our models, code, and speech samples are publicly available.

Llama-Mimi: Exploring the Limits of Flattened Speech Language Modeling

TL;DR

Llama-Mimi is proposed, which flattens multi-level RVQ tokens produced by the Mimi neural audio codec into a single sequence and models them autoregressively with a Transformer decoder, and shows that Llama-Mimi outperforms a CSM-based hierarchical model on most tasks and achieves the best performance on acoustic consistency.

Abstract

Speech Language Models (SpeechLMs) model tokenized speech to capture both semantic and acoustic information. When neural audio codecs based on Residual Vector Quantization (RVQ) are used as audio tokenizers, they produce multiple discrete tokens per time step, yielding inherently multi-level representations. To process these multi-level tokens together, prior work typically adopts hierarchical architectures to capture this structure. In contrast, recent progress in NLP has progressively reduced architectural inductive biases, moving toward simpler and more scalable single-Transformer architectures. In this work, we propose Llama-Mimi, which flattens multi-level RVQ tokens produced by the Mimi neural audio codec into a single sequence and models them autoregressively with a Transformer decoder. We show that Llama-Mimi outperforms a CSM-based hierarchical model on most tasks and achieves the best performance on acoustic consistency. Our models, code, and speech samples are publicly available.

Paper Structure

This paper contains 13 sections, 2 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Flattened (Llama-Mimi, Ours) vs. hierarchical (CSM csm2025sesame) architectures. In Llama-Mimi, discrete audio tokens from Mimi defossez2024moshispeechtextfoundationmodel are flattened into a single sequence and modeled by a single Transformer decoder, whereas the hierarchical model separates temporal and depth-wise modeling across two Transformer decoders.