DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu
TL;DR
This work tackles data scarcity and scalability in diffusion-based singing voice synthesis by introducing a two-stage data construction pipeline that fixes melodies and varies lyrics via LLMs to train melody-specific PseudoSingers, enabling synthesis of over 500 hours of high-quality Chinese singing. Building on this corpus, it presents DiTSinger, a Diffusion Transformer with RoPE and qk-norm, scaled in depth, width, and resolution to improve fidelity. A key contribution is an implicit alignment mechanism that constrains phoneme-to-acoustic attention within character spans, removing the need for phoneme-duration labels while enhancing robustness. Experiments demonstrate scalable, alignment-free, high-fidelity SVS, with DiTSinger outperforming state-of-the-art methods across objective metrics and MOS on a substantial Chinese singing dataset.
Abstract
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
