Table of Contents
Fetching ...

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu

TL;DR

This work tackles data scarcity and scalability in diffusion-based singing voice synthesis by introducing a two-stage data construction pipeline that fixes melodies and varies lyrics via LLMs to train melody-specific PseudoSingers, enabling synthesis of over 500 hours of high-quality Chinese singing. Building on this corpus, it presents DiTSinger, a Diffusion Transformer with RoPE and qk-norm, scaled in depth, width, and resolution to improve fidelity. A key contribution is an implicit alignment mechanism that constrains phoneme-to-acoustic attention within character spans, removing the need for phoneme-duration labels while enhancing robustness. Experiments demonstrate scalable, alignment-free, high-fidelity SVS, with DiTSinger outperforming state-of-the-art methods across objective metrics and MOS on a substantial Chinese singing dataset.

Abstract

Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

TL;DR

This work tackles data scarcity and scalability in diffusion-based singing voice synthesis by introducing a two-stage data construction pipeline that fixes melodies and varies lyrics via LLMs to train melody-specific PseudoSingers, enabling synthesis of over 500 hours of high-quality Chinese singing. Building on this corpus, it presents DiTSinger, a Diffusion Transformer with RoPE and qk-norm, scaled in depth, width, and resolution to improve fidelity. A key contribution is an implicit alignment mechanism that constrains phoneme-to-acoustic attention within character spans, removing the need for phoneme-duration labels while enhancing robustness. Experiments demonstrate scalable, alignment-free, high-fidelity SVS, with DiTSinger outperforming state-of-the-art methods across objective metrics and MOS on a substantial Chinese singing dataset.

Abstract

Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.

Paper Structure

This paper contains 11 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the proposed two-stage data construction pipeline. The Recording-fitting Phase (left) collects high-quality vocal recordings without accompaniment from professional singers to train a melody-specific model, PseudoSinger. The Data Expansion Phase (right) leverages the trained PseudoSinger to synthesize large-scale singing data with diverse LLM-generated lyrics while keeping the melody fixed. This enables scalable dataset construction with improved phonetic consistency and melodic alignment.
  • Figure 2: DiTSinger Training Phase. The model predicts the added noise $\boldsymbol{\epsilon}$ to the noisy mel-spectrogram tokens at each denoising step $t$, conditioned on both fine-grained (e.g., music scores, lyrics) and coarse-grained (e.g., timbre, timestep) inputs. Right: detailed structure of a single DiTBlock, which integrates Multi-Head Self-Attention with RoPE and QK-Norm, Multi-Head Cross-Attention with QK-Norm, and Adaptive Layer Normalization modulated by learnable parameters $\{\gamma_i, \beta_i\}$ and residual scaling factors $\{\alpha_i\}$.
  • Figure 3: Scaling results of DiTSinger. (a) Architectural scaling improves MCD. (b) Data scaling further boosts performance. S_2 denotes a Small model with half resolution. GFLOPS measured on 5s audio.