Table of Contents
Fetching ...

Auto-adaptive Resonance Equalization using Dilated Residual Networks

Maarten Grachten, Emmanuel Deruty, Alexandre Tanguy

TL;DR

The paper tackles automatic resonance attenuation in audio production by adopting a two-part system: a windowed dynamic equalizer that attenuates resonances and a neural predictor that selects the attenuation factor from the input audio. It compares a feature-based FFN using Essentia descriptors with an end-to-end DRN operating on raw stereo audio, both trained on ground-truth preferences collected from a listening study with 15 engineers across 150 tracks. Evaluation via a mean squared bounds error metric shows both models outperform a baseline and achieve similar accuracy, demonstrating that end-to-end approaches can match traditional feature-based methods for resonance attenuation. The work delivers a fully auto-adaptive resonance equalizer and motivates real-time plugin development for audio workstations.

Abstract

In music and audio production, attenuation of spectral resonances is an important step towards a technically correct result. In this paper we present a two-component system to automate the task of resonance equalization. The first component is a dynamic equalizer that automatically detects resonances and offers to attenuate them by a user-specified factor. The second component is a deep neural network that predicts the optimal attenuation factor based on the windowed audio. The network is trained and validated on empirical data gathered from an experiment in which sound engineers choose their preferred attenuation factors for a set of tracks. We test two distinct network architectures for the predictive model and find that a dilated residual network operating directly on the audio signal is on a par with a network architecture that requires a prior audio feature extraction stage. Both architectures predict human-preferred resonance attenuation factors significantly better than a baseline approach.

Auto-adaptive Resonance Equalization using Dilated Residual Networks

TL;DR

The paper tackles automatic resonance attenuation in audio production by adopting a two-part system: a windowed dynamic equalizer that attenuates resonances and a neural predictor that selects the attenuation factor from the input audio. It compares a feature-based FFN using Essentia descriptors with an end-to-end DRN operating on raw stereo audio, both trained on ground-truth preferences collected from a listening study with 15 engineers across 150 tracks. Evaluation via a mean squared bounds error metric shows both models outperform a baseline and achieve similar accuracy, demonstrating that end-to-end approaches can match traditional feature-based methods for resonance attenuation. The work delivers a fully auto-adaptive resonance equalizer and motivates real-time plugin development for audio workstations.

Abstract

In music and audio production, attenuation of spectral resonances is an important step towards a technically correct result. In this paper we present a two-component system to automate the task of resonance equalization. The first component is a dynamic equalizer that automatically detects resonances and offers to attenuate them by a user-specified factor. The second component is a deep neural network that predicts the optimal attenuation factor based on the windowed audio. The network is trained and validated on empirical data gathered from an experiment in which sound engineers choose their preferred attenuation factors for a set of tracks. We test two distinct network architectures for the predictive model and find that a dilated residual network operating directly on the audio signal is on a par with a network architecture that requires a prior audio feature extraction stage. Both architectures predict human-preferred resonance attenuation factors significantly better than a baseline approach.

Paper Structure

This paper contains 17 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Resonance equalization block diagram; White and gray blocks represent data and processes respectively; The green block depicts the single user-controlled parameter; The symbols $\odot$, $*$ and $-$ represent elementwise vector/vector multiplication, elementwise scalar/vector multiplication, and unary negation respectively.
  • Figure 2: Rating percentiles (in steps of 10%) per subject. Darker areas correspond to more central percentile ranges, lighter areas to more peripheral ranges. The bold line in the left plot shows the median ratings.
  • Figure 3: Correlation coefficients of ratings among subjects.
  • Figure 4: Building blocks for the FFN and DRN models. Left: Standard block composed of a dense linear layer followed by batch normalization and a rectified-linear layer (See Section \ref{['sec:pred-based-audio']}); Right: Residual block (See Section \ref{['sec:residual-blocks']}).
  • Figure 5: FFN and DRN architectures.
  • ...and 1 more figures