Table of Contents
Fetching ...

ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning

Daewoong Kim, Hao-Wen Dong, Dasaem Jeong

TL;DR

ViolinDiff addresses the challenge of expressive polyphonic violin synthesis by explicitly modeling pitch bend as a conditioning signal within a two-stage diffusion framework. A Bend Estimation Module extracts polyphonic pitch bend from MIDI, while a Synthesis Module generates mel spectrograms conditioned on piano-roll MIDI and bend information, with audio rendered by SoundStream. Quantitative metrics (FAD, vibrato) and listening tests (MUSHRA) show that explicit bend modeling yields more realistic and expressive violin sounds than a NoBend baseline, demonstrating improved vibrato handling and cross-piece generalization. The approach introduces a bend encoding scheme, a vibrato evaluation method, and a practical pathway to high-quality solo violin synthesis with publicly available data and demos.

Abstract

Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: daewoung.github.io/ViolinDiff-Demo.

ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning

TL;DR

ViolinDiff addresses the challenge of expressive polyphonic violin synthesis by explicitly modeling pitch bend as a conditioning signal within a two-stage diffusion framework. A Bend Estimation Module extracts polyphonic pitch bend from MIDI, while a Synthesis Module generates mel spectrograms conditioned on piano-roll MIDI and bend information, with audio rendered by SoundStream. Quantitative metrics (FAD, vibrato) and listening tests (MUSHRA) show that explicit bend modeling yields more realistic and expressive violin sounds than a NoBend baseline, demonstrating improved vibrato handling and cross-piece generalization. The approach introduces a bend encoding scheme, a vibrato evaluation method, and a practical pathway to high-quality solo violin synthesis with publicly available data and demos.

Abstract

Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: daewoung.github.io/ViolinDiff-Demo.
Paper Structure (15 sections, 3 figures, 1 table)

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the architectures for (a) Bend Estimation Module and (b) Synthesis Module. $R_{\text{bend}_t}$ and $M_t$ represent the $R_{\text{bend}}$ and the $M$ with noise applied at step $t$ of the diffusion process. The Denoiser in both modules adopts the non-causal WaveNet architecture as employed in DiffSinger diffsinger, with additional modifications to incorporate performer conditioning through FiLM layers, where each residual block is conditioned on performer embeddings.
  • Figure 2: Spectrogram comparison with F0 (green curves). From left to right: (a) Original audio, (b) ViolinDiff, and (c) Baseline NoBend model. Spectrograms are cropped to exclude frequencies below than violin's lowest note.
  • Figure 3: The box plot illustrates the results of the MUSHRA realism listening test. The p-values between models were calculated using the Wilcoxon signed-rank test. The green lines connect the compared models.