Table of Contents
Fetching ...

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

Hounsu Kim, Soonbeom Choi, Juhan Nam

TL;DR

This work tackles expressive acoustic guitar synthesis by introducing guitarroll, a guitar-specific input representation, and a diffusion-based outpainting framework to produce long-term coherent audio. The system combines a T5-like architecture with mel-spectrogram denoising and a Soundstream vocoder, while replacing sparse inputs with the dense guitarroll encoding. A large pretraining dataset (Lakh-Ilya) is used alongside GuitarSet to overcome data scarcity, and a RePaint-inspired outpainting strategy enables efficient continuation of long sequences. Objective metrics and listening tests show improved timbre realism and competitive overall quality compared to baselines and prior work, highlighting the approach's potential for realistic, expressive guitar synthesis.

Abstract

Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work.

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

TL;DR

This work tackles expressive acoustic guitar synthesis by introducing guitarroll, a guitar-specific input representation, and a diffusion-based outpainting framework to produce long-term coherent audio. The system combines a T5-like architecture with mel-spectrogram denoising and a Soundstream vocoder, while replacing sparse inputs with the dense guitarroll encoding. A large pretraining dataset (Lakh-Ilya) is used alongside GuitarSet to overcome data scarcity, and a RePaint-inspired outpainting strategy enables efficient continuation of long sequences. Objective metrics and listening tests show improved timbre realism and competitive overall quality compared to baselines and prior work, highlighting the approach's potential for realistic, expressive guitar synthesis.

Abstract

Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work.
Paper Structure (16 sections, 4 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: An MIDI plot of GuitarSet xi2018guitarset (upper) and the corresponding guitarroll representation (bottom). The note pitch ranges from 40 (E1, dark purple) to 84 (C6, bright yellow). The number corresponding to the onset position (85) is noticeable as yellow vertical lines in each note's starting point.
  • Figure 2: Overall model architecture. $c_{in}, c_{out}, c_{noise}, c_{skip}$ are the scaling schedules, which are dependent on $\sigma_{data}=0.1$. $\rho=9, P_{mean}=-3.0, P_{std}=1.0$. Details on the definition of each variable are described at Karras2022edm.
  • Figure 3: The proposed outpainting algorithm for the acoustic guitar synthesis task. In order to generate coherent audio results from a long MIDI note sequence, starting from the $2^{\text{nd}}$ stage the model attempts to fill in the remainder when given previously generated data.