Table of Contents
Fetching ...

Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

Ying-Shuo Lee, Yueh-Po Peng, Jui-Te Wu, Ming Cheng, Li Su, Yi-Hsuan Yang

TL;DR

The paper addresses distortion recovery for electric guitar by removing real-world effect processing from recordings. It introduces a two-stage framework: a Mel-spectrogram denoiser that produces a dry Mel representation, followed by a neural vocoder (HiFi-GAN) that reconstructs the dry waveform. Evaluations on VST-derived data and synthetic baselines show superior objective metrics (e.g., lower FAD and higher SI-SDR) and strong subjective quality (MOS around 4), especially when trained on realistic VST data. The approach demonstrates improved fidelity and practical potential for downstream tasks such as transcription and mixing, and it emphasizes the importance of realistic training data for distortion removal.

Abstract

Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.

Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

TL;DR

The paper addresses distortion recovery for electric guitar by removing real-world effect processing from recordings. It introduces a two-stage framework: a Mel-spectrogram denoiser that produces a dry Mel representation, followed by a neural vocoder (HiFi-GAN) that reconstructs the dry waveform. Evaluations on VST-derived data and synthetic baselines show superior objective metrics (e.g., lower FAD and higher SI-SDR) and strong subjective quality (MOS around 4), especially when trained on realistic VST data. The approach demonstrates improved fidelity and practical potential for downstream tasks such as transcription and mixing, and it emphasizes the importance of realistic training data for distortion removal.

Abstract

Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.
Paper Structure (18 sections, 4 equations, 4 figures, 3 tables)

This paper contains 18 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The architecture of the proposed Mel Denoiser. $N$ represents the number of layers, while $c_{bin}$, $c_{hidden}$, and $c_{emb}$ indicate the channel counts. The kernel size of the 1D convolution is denoted by $k$. Here, $c_{bin}$ matches the Mel- spectrogram bin count, and $c_{emb}$ corresponds to the embedding size. We configured $c_{hidden}$ to be four times larger than $c_{emb}$.
  • Figure 2: The Mel-spectrograms of the input wet signal, target dry signal, along with the output of the proposed model, the HiFiGAN Denoiser su2020hifi, DCUnet choi2018phaseaware, and Demucs V3 defossez2021hybrid, across a total of seven different VST plugin effects. Our model demonstrates a closer resemblance to the target signal, showcasing superior distortion reduction capabilities and better preservation of overtone characteristics.
  • Figure 3: Mean Opinion Scores for Audio Quality (AQ). The distribution indicates that our model primarily achieved ratings around 4 points, signifying a high level of signal quality post-distortion recovery. ($\text{***}=p<.001$ in statistical test).
  • Figure 4: Mean Opinion Scores for Dryness Level (DL). The concentration of ratings around 4 points for our model suggests the dryness of recovered signal is favorably compared to the ground truth, demonstrating effective distortion removal. ($\text{***} = p < .001$.)