Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

Hounsu Kim; Soonbeom Choi; Juhan Nam

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

Hounsu Kim, Soonbeom Choi, Juhan Nam

TL;DR

This work tackles expressive acoustic guitar synthesis by introducing guitarroll, a guitar-specific input representation, and a diffusion-based outpainting framework to produce long-term coherent audio. The system combines a T5-like architecture with mel-spectrogram denoising and a Soundstream vocoder, while replacing sparse inputs with the dense guitarroll encoding. A large pretraining dataset (Lakh-Ilya) is used alongside GuitarSet to overcome data scarcity, and a RePaint-inspired outpainting strategy enables efficient continuation of long sequences. Objective metrics and listening tests show improved timbre realism and competitive overall quality compared to baselines and prior work, highlighting the approach's potential for realistic, expressive guitar synthesis.

Abstract

Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work.

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Method
Overall Architecture
The Guitarroll Representation for Guitars
Diffusion-based Generative Modeling
Diffusion Outpainting Method For Continuation
Experiments
Datasets
Experiment Settings
Baselines
Objective Metrics
Listening Test
Results
Quantitative Results
Qualitative Results
...and 1 more sections

Figures (3)

Figure 1: An MIDI plot of GuitarSet xi2018guitarset (upper) and the corresponding guitarroll representation (bottom). The note pitch ranges from 40 (E1, dark purple) to 84 (C6, bright yellow). The number corresponding to the onset position (85) is noticeable as yellow vertical lines in each note's starting point.
Figure 2: Overall model architecture. $c_{in}, c_{out}, c_{noise}, c_{skip}$ are the scaling schedules, which are dependent on $\sigma_{data}=0.1$. $\rho=9, P_{mean}=-3.0, P_{std}=1.0$. Details on the definition of each variable are described at Karras2022edm.
Figure 3: The proposed outpainting algorithm for the acoustic guitar synthesis task. In order to generate coherent audio results from a long MIDI note sequence, starting from the $2^{\text{nd}}$ stage the model attempts to fill in the remainder when given previously generated data.

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

TL;DR

Abstract

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

Authors

TL;DR

Abstract

Table of Contents

Figures (3)