Table of Contents
Fetching ...

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, Bhiksha Raj

TL;DR

This work tackles the inefficiency of autoregressive audio generation caused by long sequence lengths by introducing Scale-level Audio Tokenizer (SAT) and Acoustic AutoRegressive (AAR) modeling. SAT compresses audio into multi-scale tokens via MSRQ, and AAR performs next-scale prediction to reduce autoregressive steps, formalized as $p(r_i|r_1,...,r_{i-1})$ with a scale-aware attention mask, achieving roughly $35\times$ faster inference and lower Fréchet Audio Distance on AudioSet compared to baselines. Two-stage training (SAT followed by AAR) yields improved reconstruction with fewer tokens (e.g., 455 vs 750) and faster generation, while ablations show the critical roles of scale scheduling and discriminators. This approach offers a practical and scalable path to efficient, high-fidelity audio synthesis, potentially enabling real-time or on-device AR generation and integration with multimodal systems.

Abstract

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbf{S}cale-level \textbf{A}udio \textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbf{A}coustic \textbf{A}uto\textbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable \textbf{35}$\times$ faster inference speed and +\textbf{1.33} Fréchet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: \url{https://github.com/qiuk2/AAR}.

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

TL;DR

This work tackles the inefficiency of autoregressive audio generation caused by long sequence lengths by introducing Scale-level Audio Tokenizer (SAT) and Acoustic AutoRegressive (AAR) modeling. SAT compresses audio into multi-scale tokens via MSRQ, and AAR performs next-scale prediction to reduce autoregressive steps, formalized as with a scale-aware attention mask, achieving roughly faster inference and lower Fréchet Audio Distance on AudioSet compared to baselines. Two-stage training (SAT followed by AAR) yields improved reconstruction with fewer tokens (e.g., 455 vs 750) and faster generation, while ablations show the critical roles of scale scheduling and discriminators. This approach offers a practical and scalable path to efficient, high-fidelity audio synthesis, potentially enabling real-time or on-device AR generation and integration with multimodal systems.

Abstract

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbf{S}cale-level \textbf{A}udio \textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbf{A}coustic \textbf{A}uto\textbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable \textbf{35} faster inference speed and +\textbf{1.33} Fréchet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: \url{https://github.com/qiuk2/AAR}.
Paper Structure (34 sections, 13 equations, 7 figures, 8 tables, 2 algorithms)

This paper contains 34 sections, 13 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: Autoregressive modeling of audio. (a) Next-token prediction: sequential token generation in chronological order (left to right), which aligns with the natural temporal structure of audio; (b) Next-scale prediction: multi-scale token maps are autoregressively generated from coarse to fine scales (lower to higher resolutions). Tokens are generated in parallel within each scale, which reduces about 40x the AR prediction iteration.
  • Figure 2: Our model involves two distinct training phases. Stage 1: Scale-level Audio Tokenizer (SAT) to encode an audio sample into a series of $K$ tokens scales, donated as $\mathcal{R} = (r_1, r_2, \dots, r_K)$. Each scale encodes information in different frequencies of the audio waveform. Stage 2: Acoustic AutoRegressive (AAR) modeling via next-scale prediction relies on the pre-trained SAT to predict each scale-level token $r_i$ by conditioning on all previously predicted scales $r_{<i}$ and a CLAP token wu2023large as the start token. The CLAP token is derived from ground truth audio. During training, we use the standard cross-entropy loss and the attention mask as figured above to ensure that each $r_i$ can only be attributed by $r_{\leq i}$ and the start token.
  • Figure 3: Performance of autoregressive model when classifier-free guidance is 10. next-token: AR via next-token prediction; next-scale: our AAR.
  • Figure 4: Performance of AAR in different classifier-free guidance scales from 2 to 18 (left to right), with each point incremented by 2. The red line represents Fréchet Audio Distance (FAD) v.s. Inception Score (ISc), while the blue line represents Kullback-Leibler divergence (KL) vs. Inception Score (ISc).
  • Figure 5: Visualization of Linear, Quadratic, and Logarithmic scale scheduling across the range from 1 to 75.
  • ...and 2 more figures