Continuous Audio Language Models
Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, Alexandre Défossez
TL;DR
CALM introduces Continuous Audio Language Models that operate in the continuous latent space of a VAE, bypassing lossy RVQ quantization. By coupling a noise-robust long-context Transformer with a lightweight short-context transformer and a fast continuous consistency head, CALM achieves high-fidelity audio generation with substantial inference speedups compared to diffusion-based and discrete-token baselines. Key innovations include Gaussian temperature sampling, head batch multiplier, latent CFG, and latent distillation, leading to practical 1-step sampling that rivals or surpasses previous state-of-the-art in speech and music tasks. The results demonstrate strong performance for speech continuation, TTS, music continuation, and text-to-music generation, with Pocket TTS illustrating real-time CPU viability on modest hardware.
Abstract
Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at hf.co/spaces/kyutai/calm-samples. Finally, we release Pocket TTS, an open-source 100M-parameter text-to-speech model that can run faster than real time on a laptop CPU: github.com/kyutai-labs/pocket-tts.
