Table of Contents
Fetching ...

Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

Nithya Shikarpur, Krishna Maneesha Dendukuri, Yusong Wu, Antoine Caillon, Cheng-Zhi Anna Huang

TL;DR

This work addresses the challenge of generating Hindustani vocal melodies by introducing GaMaDHaNi, a two-level hierarchical model that uses finely quantized pitch contours as an intermediate representation to drive audio synthesis. The Pitch Generator (autoregressive or diffusion) produces a detailed pitch contour, which the Spectrogram Generator (diffusion-based and conditioned on singer and pitch) converts into mel-spectrograms for audio via a vocoder. With a 120-hour dataset, the method demonstrates competitive melodic quality in listening tests and shows meaningful pitch adherence (mean $r=0.71$) despite data limitations, enabling interactive use cases such as primed generation and coarse pitch conditioning. The approach highlights a path toward human-AI collaboration in Hindustani music by providing interpretable intermediate representations and controllable synthesis, while acknowledging limitations related to tonal, rhythmic (tala), and ornamentation (gamak) aspects and suggesting future work on vocoders and richer conditioning.

Abstract

Hindustani music is a performance-driven oral tradition that exhibits the rendition of rich melodic patterns. In this paper, we focus on generative modeling of singers' vocal melodies extracted from audio recordings, as the voice is musically prominent within the tradition. Prior generative work in Hindustani music models melodies as coarse discrete symbols which fails to capture the rich expressive melodic intricacies of singing. Thus, we propose to use a finely quantized pitch contour, as an intermediate representation for hierarchical audio modeling. We propose GaMaDHaNi, a modular two-level hierarchy, consisting of a generative model on pitch contours, and a pitch contour to audio synthesis model. We compare our approach to non-hierarchical audio models and hierarchical models that use a self-supervised intermediate representation, through a listening test and qualitative analysis. We also evaluate audio model's ability to faithfully represent the pitch contour input using Pearson correlation coefficient. By using pitch contours as an intermediate representation, we show that our model may be better equipped to listen and respond to musicians in a human-AI collaborative setting by highlighting two potential interaction use cases (1) primed generation, and (2) coarse pitch conditioning.

Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

TL;DR

This work addresses the challenge of generating Hindustani vocal melodies by introducing GaMaDHaNi, a two-level hierarchical model that uses finely quantized pitch contours as an intermediate representation to drive audio synthesis. The Pitch Generator (autoregressive or diffusion) produces a detailed pitch contour, which the Spectrogram Generator (diffusion-based and conditioned on singer and pitch) converts into mel-spectrograms for audio via a vocoder. With a 120-hour dataset, the method demonstrates competitive melodic quality in listening tests and shows meaningful pitch adherence (mean ) despite data limitations, enabling interactive use cases such as primed generation and coarse pitch conditioning. The approach highlights a path toward human-AI collaboration in Hindustani music by providing interpretable intermediate representations and controllable synthesis, while acknowledging limitations related to tonal, rhythmic (tala), and ornamentation (gamak) aspects and suggesting future work on vocoders and richer conditioning.

Abstract

Hindustani music is a performance-driven oral tradition that exhibits the rendition of rich melodic patterns. In this paper, we focus on generative modeling of singers' vocal melodies extracted from audio recordings, as the voice is musically prominent within the tradition. Prior generative work in Hindustani music models melodies as coarse discrete symbols which fails to capture the rich expressive melodic intricacies of singing. Thus, we propose to use a finely quantized pitch contour, as an intermediate representation for hierarchical audio modeling. We propose GaMaDHaNi, a modular two-level hierarchy, consisting of a generative model on pitch contours, and a pitch contour to audio synthesis model. We compare our approach to non-hierarchical audio models and hierarchical models that use a self-supervised intermediate representation, through a listening test and qualitative analysis. We also evaluate audio model's ability to faithfully represent the pitch contour input using Pearson correlation coefficient. By using pitch contours as an intermediate representation, we show that our model may be better equipped to listen and respond to musicians in a human-AI collaborative setting by highlighting two potential interaction use cases (1) primed generation, and (2) coarse pitch conditioning.
Paper Structure (23 sections, 6 equations, 6 figures)

This paper contains 23 sections, 6 equations, 6 figures.

Figures (6)

  • Figure 1: Extracted pitch from Hindustani vocal audio highlighting the melodic intricacies involved. Solfege notation is highlighted as a horizontal grid.
  • Figure 2: The overall hierarchical generation structure of GaMaDHaNi comprising of the Pitch Generator, the Spectrogram Generator and a vocoder. During inference, given an optional short melodic input, i.e. 'prime', each of the generators produce a pitch continuation and a spectrogram conditioned on the resulting pitch respectively.
  • Figure 3: Results from the listening study, showing how many times each system was preferred.
  • Figure 4: Examples of ground truth pitch (blue) and extracted pitch contour from the generated sample (orange) to highlight pitch adherence with low and high correlation, $r$ (top to bottom). Low correlation: Audio 0 ($r=0.1$) and 1 ($r=0.11)$ are examples of errors in pitch detection. High correlation: Audio 2 ($r=0.94$) and 3 ($r=0.99$)
  • Figure 5: A staircase descending scale (in blue) as a coarse input. This input is then processed as described in Sec. \ref{['subsec:coarse-pitch']} and fed into the model. The generated fine-grain contour (in orange) has glides (mindh) and fast jerky movement (gamak) characteristic to Hindustani music.
  • ...and 1 more figures