Table of Contents
Fetching ...

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang

TL;DR

This work develops native open discrete audio foundation models trained with next-token prediction over interleaved semantic, acoustic, and text tokens, introducing utterance-level interleaving to leverage transcripts without word-level alignment. It establishes first scaling laws for discrete audio via IsoFLOP analysis across 64 models ($3\times 10^{18}$ to $3\times 10^{20}$ FLOPs), finding $D^* \propto C^{0.579}$ and $N^* \propto C^{0.367}$, indicating data scales faster than model size due to lower information density. The authors train SODA (135M–4B params) on 500B tokens (≈$1.3\times 10^{22}$ FLOPs), show cold-start pretraining yields better stability and ASR performance than warm-start, and demonstrate cross-modal capabilities including voice-preserving speech-to-speech translation by fine-tuning SODA within the same NTP framework. Overall, the paper provides validated training recipes, scaling laws, and a flexible unified backbone for audio/text tasks, enabling end-to-end audio generation and cross-modal understanding with practical implications for accessible, general-purpose audio AI. In addition, it releases checkpoints, data, and code to foster further research and democratize access to audio foundation-model research.

Abstract

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

TL;DR

This work develops native open discrete audio foundation models trained with next-token prediction over interleaved semantic, acoustic, and text tokens, introducing utterance-level interleaving to leverage transcripts without word-level alignment. It establishes first scaling laws for discrete audio via IsoFLOP analysis across 64 models ( to FLOPs), finding and , indicating data scales faster than model size due to lower information density. The authors train SODA (135M–4B params) on 500B tokens (≈ FLOPs), show cold-start pretraining yields better stability and ASR performance than warm-start, and demonstrate cross-modal capabilities including voice-preserving speech-to-speech translation by fine-tuning SODA within the same NTP framework. Overall, the paper provides validated training recipes, scaling laws, and a flexible unified backbone for audio/text tasks, enabling end-to-end audio generation and cross-modal understanding with practical implications for accessible, general-purpose audio AI. In addition, it releases checkpoints, data, and code to foster further research and democratize access to audio foundation-model research.

Abstract

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning to FLOPs, finding that optimal data grows 1.6 faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
Paper Structure (44 sections, 5 equations, 11 figures, 8 tables)

This paper contains 44 sections, 5 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Three token types examined in § \ref{['sec:token_composition']}: (a) Semantic-only, (b) Semantic+Acoustic, and (c) Utterance-level interleaved Semantic+Acoustic+Text, where (c.1) shows text-first format, and (c.2) shows audio-first format. Superscript $i$ denotes utterance index, where utterance($i$) and utterance($i$+1) are adjacent audio segments from the same document. See Appendix \ref{['sec:appendix_formatting']} for detailed data formatting.
  • Figure 2: Impact of adding Nemotron on NLL for audio and text validation data. Full results on other metrics are shown in Figure \ref{['fig:nemotron_sweep_appendix']}.
  • Figure 3: Validation NLL (audio+text) versus downstream task performance. Circular points: 64 IsoFLOP models (§\ref{['sec:scaling']}); star-shaped points: final SODA runs (§\ref{['sec:largescale']}). Regression lines are fitted on 64 IsoFLOP models only. Full results with other metrics are shown in Appendix \ref{['sec:appendix_nll']}.
  • Figure 4: IsoFLOP analysis for discrete audio modeling. (a) and (b) show the loss landscape across model sizes and token counts for each compute budget. (c) shows the fitted scaling laws with extrapolation, revealing that optimal data $D^*$ scales faster than optimal model size.
  • Figure 5: Full results of text-injection ratio sweep (0%--90% Nemotron text data). Colors indicate the trend as text percentage increases: Red (Cross-Modal skills: ASR, TTS) shows degradation in performance as text ratio increases beyond 5%. Orange (Semantic and Acoustic understanding: sWUGGY, sBLIMP, Salmon) indicates little variation across text ratios, suggesting these capabilities are robust to text inclusion. Green (Text Knowledge tasks: tWUGGY, tBLIMP, HellaSwag) shows monotonic improvement as text percentage increases, with notable gains from 0% to 2.5%.
  • ...and 6 more figures