Table of Contents
Fetching ...

Continuous Audio Language Models

Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, Alexandre Défossez

TL;DR

CALM introduces Continuous Audio Language Models that operate in the continuous latent space of a VAE, bypassing lossy RVQ quantization. By coupling a noise-robust long-context Transformer with a lightweight short-context transformer and a fast continuous consistency head, CALM achieves high-fidelity audio generation with substantial inference speedups compared to diffusion-based and discrete-token baselines. Key innovations include Gaussian temperature sampling, head batch multiplier, latent CFG, and latent distillation, leading to practical 1-step sampling that rivals or surpasses previous state-of-the-art in speech and music tasks. The results demonstrate strong performance for speech continuation, TTS, music continuation, and text-to-music generation, with Pocket TTS illustrating real-time CPU viability on modest hardware.

Abstract

Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at hf.co/spaces/kyutai/calm-samples. Finally, we release Pocket TTS, an open-source 100M-parameter text-to-speech model that can run faster than real time on a laptop CPU: github.com/kyutai-labs/pocket-tts.

Continuous Audio Language Models

TL;DR

CALM introduces Continuous Audio Language Models that operate in the continuous latent space of a VAE, bypassing lossy RVQ quantization. By coupling a noise-robust long-context Transformer with a lightweight short-context transformer and a fast continuous consistency head, CALM achieves high-fidelity audio generation with substantial inference speedups compared to diffusion-based and discrete-token baselines. Key innovations include Gaussian temperature sampling, head batch multiplier, latent CFG, and latent distillation, leading to practical 1-step sampling that rivals or surpasses previous state-of-the-art in speech and music tasks. The results demonstrate strong performance for speech continuation, TTS, music continuation, and text-to-music generation, with Pocket TTS illustrating real-time CPU viability on modest hardware.

Abstract

Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at hf.co/spaces/kyutai/calm-samples. Finally, we release Pocket TTS, an open-source 100M-parameter text-to-speech model that can run faster than real time on a laptop CPU: github.com/kyutai-labs/pocket-tts.

Paper Structure

This paper contains 42 sections, 10 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Overview of our model. During training, latent vectors $\mathbf{x}^s$ are noised to encourage the backbone Transformer to focus on coarse structure. The consistency head is a consistency model conditioned on the latent variable $\mathbf{z_\text{long}^s}$ produced by the backbone, as well as a short-term context vector $\mathbf{z_\text{short}^s}$ computed from a short-context Transformer applied to the most recent clean latent tokens.
  • Figure 2: Average pairwise speaker similarity over 100 unprompted 10s generations, or 100 10s examples from the ground truth dataset as reference. As expected, for both methods models generate more diverse speakers (i.e. less pairwise speaker similarity) as temperature increases.
  • Figure 3: Effect of the head batch multiplier value. Training a model (music consistency CALM) with a higher batch size multiplier fastens the convergence for the FAD metric. All the evaluations are done with 4 steps of consistency at inference time.