UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
TL;DR
UniAudio 2.0 tackles the core challenges of building a unified audio-language foundation model by introducing ReasoningCodec, a two-stream discrete audio tokenizer that separates text-aligned reasoning tokens from reconstruction tokens, enabling both understanding and high-fidelity generation. The model employs a functionally specialized autoregressive architecture with lower layers for understanding, middle cross-modal layers initialized from an LLM, and upper layers for generation, trained via a four-stage, multi-task paradigm on 100B text tokens and 60B audio tokens. Empirical results show competitive performance on seen tasks and strong few-shot and zero-shot generalization across speech, sound, and music, supported by extensive ablations on data, architecture, and training. The work demonstrates the potential of explicit tokenization and layer specialization to scale unified audio-language modeling, with practical implications for cross-domain audio capabilities and future scaling directions.
Abstract
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
