DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation
Rui Lin, Zhiyue Wu, Jiahe Le, Kangdi Wang, Weixiong Chen, Junyu Dai, Tao Jiang
TL;DR
This work tackles the core tension between high-fidelity audio reconstruction and language-model (LM) learnability in vocal–accompaniment generation. It introduces Duo-Tok, a four-stage SSL-centered tokenizer with source-aware dual-codebooks for vocals and accompaniment, augmented by Gaussian replacement noise, multi-task supervision, and a latent diffusion decoder. At an ultra-low bitrate of $0.75$ kbps, Duo-Tok achieves the best music-tagging performance and the lowest vocabulary-normalized LM perplexity among baselines, while maintaining reconstruction comparable to state-of-the-art tokenizers. The results demonstrate that semantically decoupled, dual-track codes coupled with stage-wise optimization can yield LM-friendly representations without sacrificing fidelity, enabling better dual-track language modeling and controllable vocal–instrumental generation. The paper also discusses limitations, such as vocal–instrumental asymmetry and the role of separation quality, and suggests future work on MIDI-symbol alignment and joint symbolic-audio modeling.
Abstract
Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.
