Table of Contents
Fetching ...

DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

Rui Lin, Zhiyue Wu, Jiahe Le, Kangdi Wang, Weixiong Chen, Junyu Dai, Tao Jiang

TL;DR

This work tackles the core tension between high-fidelity audio reconstruction and language-model (LM) learnability in vocal–accompaniment generation. It introduces Duo-Tok, a four-stage SSL-centered tokenizer with source-aware dual-codebooks for vocals and accompaniment, augmented by Gaussian replacement noise, multi-task supervision, and a latent diffusion decoder. At an ultra-low bitrate of $0.75$ kbps, Duo-Tok achieves the best music-tagging performance and the lowest vocabulary-normalized LM perplexity among baselines, while maintaining reconstruction comparable to state-of-the-art tokenizers. The results demonstrate that semantically decoupled, dual-track codes coupled with stage-wise optimization can yield LM-friendly representations without sacrificing fidelity, enabling better dual-track language modeling and controllable vocal–instrumental generation. The paper also discusses limitations, such as vocal–instrumental asymmetry and the role of separation quality, and suggests future work on MIDI-symbol alignment and joint symbolic-audio modeling.

Abstract

Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.

DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

TL;DR

This work tackles the core tension between high-fidelity audio reconstruction and language-model (LM) learnability in vocal–accompaniment generation. It introduces Duo-Tok, a four-stage SSL-centered tokenizer with source-aware dual-codebooks for vocals and accompaniment, augmented by Gaussian replacement noise, multi-task supervision, and a latent diffusion decoder. At an ultra-low bitrate of kbps, Duo-Tok achieves the best music-tagging performance and the lowest vocabulary-normalized LM perplexity among baselines, while maintaining reconstruction comparable to state-of-the-art tokenizers. The results demonstrate that semantically decoupled, dual-track codes coupled with stage-wise optimization can yield LM-friendly representations without sacrificing fidelity, enabling better dual-track language modeling and controllable vocal–instrumental generation. The paper also discusses limitations, such as vocal–instrumental asymmetry and the role of separation quality, and suggests future work on MIDI-symbol alignment and joint symbolic-audio modeling.

Abstract

Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.

Paper Structure

This paper contains 42 sections, 14 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Reconstruction–Generation trade-off visualized with Codec-Evaluation. Each bubble corresponds to a codec: the x-axis is PPL@1024, the y-axis is log-Mel L1 distance, and bubble size encodes bitrate. Existing codecs form an approximate Pareto frontier. Duo-Tok is designed to shift this frontier toward jointly lower perplexity and competitive reconstruction quality at very low bitrate.
  • Figure 2: Stage-1 BEST-RQ--style SSL pretraining. A Transformer encoder operates on log-Mel spectrograms with a masked-prediction objective.
  • Figure 3: Stage-2 multi-task fine-tuning with Gaussian replacement at the bottleneck. Downstream auxiliary heads (lyrics ASR, Mel, chroma, and MSS-mask) act as semantic guardrails on top of the noise-injected encoder representation.
  • Figure 4: Stage-3 dual-codebook SimVQ with hard routing. Vocal and instrumental stems are routed to separate branches with independent codebooks, while the encoder is kept frozen.
  • Figure 5: Language-model evaluation protocols. We train (i) an unconditional dual-track LM over vocal and instrumental tokens, and (ii) a vocal-conditioned LM that predicts accompaniment given vocal tokens and an accompaniment prefix.