UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens
Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Yinghao Liu, Zheng Xue, Gang Song, Boyang Zhou
TL;DR
UniTok-Audio introduces a unified, decoder-only autoregressive framework for time-aligned audio tasks by modeling target audio as discrete tokens produced by a dual-stream H-Codec. The approach leverages continuous conditioning from text and audio SSL features, guided by a task token to unify five operational modes (SR, TSE, SS, VC, LASS) within a single LM backbone, enabling high-fidelity reconstruction with relatively modest model sizes. A novel H-Codec with separate acoustic and semantic codebooks and a four-layer RVQ backbone achieves low-frame-rate, high-quality waveform reconstruction, while the AR token prediction uses a delay pattern to balance performance and efficiency. Across extensive experiments on speech, music, and general audio, UniTok-Audio demonstrates competitive performance with state-of-the-art task-specific and multi-task baselines, highlighting its potential as a foundation model for unified AR audio generation and its practical impact for extensible audio synthesis research. The work also provides data-simulation pipelines and open-source release plans, enabling broader community adoption and downstream experimentation.
Abstract
Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.
