Table of Contents
Fetching ...

Towards Generalizability to Tone and Content Variations in the Transcription of Amplifier Rendered Electric Guitar Audio

Yu-Hua Chen, Yuan-Chiao Cheng, Yen-Tung Yeh, Jui-Te Wu, Jyh-Shing Roger Jang, Yi-Hsuan Yang

TL;DR

This work tackles automatic transcription of amplifier-rendered electric guitar audio, a domain hampered by limited data and diverse tone variations. It introduces EGDB-PG, a large-toned dataset with 256 presets (16 amplifier heads × 16 cabinets), and the Tone-informed Transformer (TIT), which conditions transcription on tone embeddings $c=f_{\text{tone}}(\mathbf{x}_{r,\theta})$ to yield a score $\hat{\mathbf{s}} = h_{\text{trans}}(\mathbf{x}_{r,\theta}, c)$. Through extensive ablations and comparisons with baselines, the study demonstrates that tone embeddings, content augmentation, and audio normalization significantly improve transcription accuracy across in-domain and out-of-domain amplifier tones, with TIT outperforming existing architectures. Out-of-domain experiments using Neural DSP tones show that a large, diverse tone prior coupled with tone-aware conditioning enhances generalization, establishing a practical framework for robust tone-aware electric guitar transcription and paving the way for broader effect-rendered transcription research.

Abstract

Transcribing electric guitar recordings is challenging due to the scarcity of diverse datasets and the complex tone-related variations introduced by amplifiers, cabinets, and effect pedals. To address these issues, we introduce EGDB-PG, a novel dataset designed to capture a wide range of tone-related characteristics across various amplifier-cabinet configurations. In addition, we propose the Tone-informed Transformer (TIT), a Transformer-based transcription model enhanced with a tone embedding mechanism that leverages learned representations to improve the model's adaptability to tone-related nuances. Experiments demonstrate that TIT, trained on EGDB-PG, outperforms existing baselines across diverse amplifier types, with transcription accuracy improvements driven by the dataset's diversity and the tone embedding technique. Through detailed benchmarking and ablation studies, we evaluate the impact of tone augmentation, content augmentation, audio normalization, and tone embedding on transcription performance. This work advances electric guitar transcription by overcoming limitations in dataset diversity and tone modeling, providing a robust foundation for future research.

Towards Generalizability to Tone and Content Variations in the Transcription of Amplifier Rendered Electric Guitar Audio

TL;DR

This work tackles automatic transcription of amplifier-rendered electric guitar audio, a domain hampered by limited data and diverse tone variations. It introduces EGDB-PG, a large-toned dataset with 256 presets (16 amplifier heads × 16 cabinets), and the Tone-informed Transformer (TIT), which conditions transcription on tone embeddings to yield a score . Through extensive ablations and comparisons with baselines, the study demonstrates that tone embeddings, content augmentation, and audio normalization significantly improve transcription accuracy across in-domain and out-of-domain amplifier tones, with TIT outperforming existing architectures. Out-of-domain experiments using Neural DSP tones show that a large, diverse tone prior coupled with tone-aware conditioning enhances generalization, establishing a practical framework for robust tone-aware electric guitar transcription and paving the way for broader effect-rendered transcription research.

Abstract

Transcribing electric guitar recordings is challenging due to the scarcity of diverse datasets and the complex tone-related variations introduced by amplifiers, cabinets, and effect pedals. To address these issues, we introduce EGDB-PG, a novel dataset designed to capture a wide range of tone-related characteristics across various amplifier-cabinet configurations. In addition, we propose the Tone-informed Transformer (TIT), a Transformer-based transcription model enhanced with a tone embedding mechanism that leverages learned representations to improve the model's adaptability to tone-related nuances. Experiments demonstrate that TIT, trained on EGDB-PG, outperforms existing baselines across diverse amplifier types, with transcription accuracy improvements driven by the dataset's diversity and the tone embedding technique. Through detailed benchmarking and ablation studies, we evaluate the impact of tone augmentation, content augmentation, audio normalization, and tone embedding on transcription performance. This work advances electric guitar transcription by overcoming limitations in dataset diversity and tone modeling, providing a robust foundation for future research.

Paper Structure

This paper contains 30 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Log-mel spectrograms of recordings of (a) clean audio and (b) audio rendered by a high-gain amplifier for the same musical content, demonstrating the significant alteration in frequency distribution and intensity. The high-gain processing introduces additional harmonic content and noise, illustrating the challenges faced by transcription models in generalizing across varied audio signals.
  • Figure 2: Visualization of audio representations with varying amplifier presets from the EGDB-PG dataset. (a) Piano roll displaying the note labels of the audio content. (b–e) Log-mel spectrograms of the same audio content processed with one setting from each amplifier type category defined in EGDB-PG: (b) clean (DI) audio, (c) low-gain, (d) crunch, and (e) high-gain amplifiers. These visualizations demonstrate how amplifier settings modify the spectral characteristics of the audio, including differences in harmonic and non-harmonic growth as gain increases, while preserving the underlying musical content.
  • Figure 3: Architecture of the Tone-informed Transformer (TIT). The model features a frequency encoder with two modules: the tone-informed frequency-axis Transformer encoder, which processes spectrogram inputs along the frequency axis with tone embeddings integrated via cross-attention, and the tone-informed pitch-axis Transformer decoder, which refines pitch-related features. The frequency encoder generates predictions for note onsets, offsets, and frames, which are further processed by a time-axis Transformer encoder.
  • Figure 4: Figure illustrating match note onset F1 values tested on the EGDB-PG test split. The blue line represents the TIT trained with EGDB-PG, while the green line represents the TIT trained with content augmentation, in addition to EGDB-PG. The histograms display the pitch distributions from EGDB-PG and amplifier rendered GuitarSet. The top row shows F1 values from the EGDB-PG only training approach across three amplifiers (low-gain, crunch, and high-gain). The bottom row compares models trained with EGDB-PG alone to those trained with both EGDB-PG and content augmentation. Results demonstrate that utilized content augmentation improves performance, especially for higher pitches, by leveraging the combined datasets.
  • Figure 5: Illustration of out-of-domain 10-second audio clips. The leftmost (a) figure displays the log-mel spectrogram representation of the audio. The middle (b) figure shows the piano roll predictions from the TIT model, while the rightmost (c) figure presents the piano roll predictions from the hFT-Maestro-EGDB-PG model, finetuned on the EGDB-PG dataset. This comparison highlights the transcription performance of the two methods, with the hFT-M-finetuned model generating notes that are often unplayable on a guitar, likely due to piano-specific trait from the Maestro dataset.