Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang
TL;DR
This work develops native open discrete audio foundation models trained with next-token prediction over interleaved semantic, acoustic, and text tokens, introducing utterance-level interleaving to leverage transcripts without word-level alignment. It establishes first scaling laws for discrete audio via IsoFLOP analysis across 64 models ($3\times 10^{18}$ to $3\times 10^{20}$ FLOPs), finding $D^* \propto C^{0.579}$ and $N^* \propto C^{0.367}$, indicating data scales faster than model size due to lower information density. The authors train SODA (135M–4B params) on 500B tokens (≈$1.3\times 10^{22}$ FLOPs), show cold-start pretraining yields better stability and ASR performance than warm-start, and demonstrate cross-modal capabilities including voice-preserving speech-to-speech translation by fine-tuning SODA within the same NTP framework. Overall, the paper provides validated training recipes, scaling laws, and a flexible unified backbone for audio/text tasks, enabling end-to-end audio generation and cross-modal understanding with practical implications for accessible, general-purpose audio AI. In addition, it releases checkpoints, data, and code to foster further research and democratize access to audio foundation-model research.
Abstract
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
