Table of Contents
Fetching ...

LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning

Sandipan Dhar, Mayank Gupta, Preeti Rao

TL;DR

This work tackles singing voice synthesis in low-resource languages by integrating language-aware embeddings and prosody-guided learning into a diffusion-based SVS framework. LAPS-Diff fuses Hindi linguistic content (IndicBERT, XPhoneBERT) with music-score context and adds style and pitch supervision via a dedicated encoder and a JDCNet-based pitch loss, while leveraging musical and linguistic priors (MERT, IndicWav2Vec) during denoising. The approach yields significant improvements over DiffSinger on a new Bollywood Hindi dataset, demonstrated through objective metrics and MOS studies, and is shown to better preserve pitch dynamics and expressive nuance. The results highlight the value of combining linguistic embeddings, prosody-aware losses, and prior embeddings for high-quality SVS in low-resource settings, with future work aiming for computational efficiency and multilingual expansion.

Abstract

The field of Singing Voice Synthesis (SVS) has seen significant advancements in recent years due to the rapid progress of diffusion-based approaches. However, capturing vocal style, genre-specific pitch inflections, and language-dependent characteristics remains challenging, particularly in low-resource scenarios. To address this, we propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism, specifically designed for Bollywood Hindi singing style. We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation. Additionally, we incorporated a style encoder and a pitch extraction model to compute style and pitch losses, capturing features essential to the naturalness and expressiveness of the synthesized singing, particularly in terms of vocal style and pitch variations. Furthermore, we utilize MERT and IndicWav2Vec models to extract musical and contextual embeddings, serving as conditional priors to refine the acoustic feature generation process further. Based on objective and subjective evaluations, we demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset that is typical of the low resource scenario.

LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning

TL;DR

This work tackles singing voice synthesis in low-resource languages by integrating language-aware embeddings and prosody-guided learning into a diffusion-based SVS framework. LAPS-Diff fuses Hindi linguistic content (IndicBERT, XPhoneBERT) with music-score context and adds style and pitch supervision via a dedicated encoder and a JDCNet-based pitch loss, while leveraging musical and linguistic priors (MERT, IndicWav2Vec) during denoising. The approach yields significant improvements over DiffSinger on a new Bollywood Hindi dataset, demonstrated through objective metrics and MOS studies, and is shown to better preserve pitch dynamics and expressive nuance. The results highlight the value of combining linguistic embeddings, prosody-aware losses, and prior embeddings for high-quality SVS in low-resource settings, with future work aiming for computational efficiency and multilingual expansion.

Abstract

The field of Singing Voice Synthesis (SVS) has seen significant advancements in recent years due to the rapid progress of diffusion-based approaches. However, capturing vocal style, genre-specific pitch inflections, and language-dependent characteristics remains challenging, particularly in low-resource scenarios. To address this, we propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism, specifically designed for Bollywood Hindi singing style. We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation. Additionally, we incorporated a style encoder and a pitch extraction model to compute style and pitch losses, capturing features essential to the naturalness and expressiveness of the synthesized singing, particularly in terms of vocal style and pitch variations. Furthermore, we utilize MERT and IndicWav2Vec models to extract musical and contextual embeddings, serving as conditional priors to refine the acoustic feature generation process further. Based on objective and subjective evaluations, we demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset that is typical of the low resource scenario.

Paper Structure

This paper contains 18 sections, 5 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Waveform, pitch track and music score for a song segment from our dataset.
  • Figure 2: Schematic overview of the proposed LAPS-Diff model, including training and inference stages. The components enclosed within the red-dashed boxes represent the specific enhancements of this work over the DiffSinger framework.
  • Figure 3: PCA visualization of content embeddings extracted using IndicWav2Vec from ground truth, DiffSinger, and LAPS-Diff outputs, illustrating content-level similarity. Here, each color represents a unique audio segment.
  • Figure 4: Visualization of the F0 contour comparing ground truth with synthesized outputs from DiffSinger and LAPS-Diff, all with reference to the MIDI score. The vertical axis shows frequency (Hz), and the horizontal axis represents time (seconds). Top row contains a sample with faster singing rate, and bottom shows a sample with slower singing rate.
  • Figure 5: Mel spectrograms comparing ground truth with synthesized outputs from DiffSinger and LAPS-Diff. Top row contains a sample with faster singing rate, whereas the bottom row features a sample with slower singing rate.