Table of Contents
Fetching ...

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Yinghao Liu, Zheng Xue, Gang Song, Boyang Zhou

TL;DR

UniTok-Audio introduces a unified, decoder-only autoregressive framework for time-aligned audio tasks by modeling target audio as discrete tokens produced by a dual-stream H-Codec. The approach leverages continuous conditioning from text and audio SSL features, guided by a task token to unify five operational modes (SR, TSE, SS, VC, LASS) within a single LM backbone, enabling high-fidelity reconstruction with relatively modest model sizes. A novel H-Codec with separate acoustic and semantic codebooks and a four-layer RVQ backbone achieves low-frame-rate, high-quality waveform reconstruction, while the AR token prediction uses a delay pattern to balance performance and efficiency. Across extensive experiments on speech, music, and general audio, UniTok-Audio demonstrates competitive performance with state-of-the-art task-specific and multi-task baselines, highlighting its potential as a foundation model for unified AR audio generation and its practical impact for extensible audio synthesis research. The work also provides data-simulation pipelines and open-source release plans, enabling broader community adoption and downstream experimentation.

Abstract

Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

TL;DR

UniTok-Audio introduces a unified, decoder-only autoregressive framework for time-aligned audio tasks by modeling target audio as discrete tokens produced by a dual-stream H-Codec. The approach leverages continuous conditioning from text and audio SSL features, guided by a task token to unify five operational modes (SR, TSE, SS, VC, LASS) within a single LM backbone, enabling high-fidelity reconstruction with relatively modest model sizes. A novel H-Codec with separate acoustic and semantic codebooks and a four-layer RVQ backbone achieves low-frame-rate, high-quality waveform reconstruction, while the AR token prediction uses a delay pattern to balance performance and efficiency. Across extensive experiments on speech, music, and general audio, UniTok-Audio demonstrates competitive performance with state-of-the-art task-specific and multi-task baselines, highlighting its potential as a foundation model for unified AR audio generation and its practical impact for extensible audio synthesis research. The work also provides data-simulation pipelines and open-source release plans, enabling broader community adoption and downstream experimentation.

Abstract

Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.

Paper Structure

This paper contains 29 sections, 4 equations, 2 figures, 12 tables.

Figures (2)

  • Figure 1: The overall architecture of UniTok-Audio, which is a straightforward model for multiple audio tasks. For simplicity, we illustrate the AR process with single-layer codec tokens and it actually operates in a multi-layer AR manner with delay pattern.
  • Figure 2: The framework of our proposed H-codec.