ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning
Daewoong Kim, Hao-Wen Dong, Dasaem Jeong
TL;DR
ViolinDiff addresses the challenge of expressive polyphonic violin synthesis by explicitly modeling pitch bend as a conditioning signal within a two-stage diffusion framework. A Bend Estimation Module extracts polyphonic pitch bend from MIDI, while a Synthesis Module generates mel spectrograms conditioned on piano-roll MIDI and bend information, with audio rendered by SoundStream. Quantitative metrics (FAD, vibrato) and listening tests (MUSHRA) show that explicit bend modeling yields more realistic and expressive violin sounds than a NoBend baseline, demonstrating improved vibrato handling and cross-piece generalization. The approach introduces a bend encoding scheme, a vibrato evaluation method, and a practical pathway to high-quality solo violin synthesis with publicly available data and demos.
Abstract
Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: daewoung.github.io/ViolinDiff-Demo.
