Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting
Hounsu Kim, Soonbeom Choi, Juhan Nam
TL;DR
This work tackles expressive acoustic guitar synthesis by introducing guitarroll, a guitar-specific input representation, and a diffusion-based outpainting framework to produce long-term coherent audio. The system combines a T5-like architecture with mel-spectrogram denoising and a Soundstream vocoder, while replacing sparse inputs with the dense guitarroll encoding. A large pretraining dataset (Lakh-Ilya) is used alongside GuitarSet to overcome data scarcity, and a RePaint-inspired outpainting strategy enables efficient continuation of long sequences. Objective metrics and listening tests show improved timbre realism and competitive overall quality compared to baselines and prior work, highlighting the approach's potential for realistic, expressive guitar synthesis.
Abstract
Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work.
