Table of Contents
Fetching ...

ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time Optimization

Christian J. Steinmetz, Shubhr Singh, Marco Comunità, Ilias Ibnyahya, Shanxin Yuan, Emmanouil Benetos, Joshua D. Reiss

TL;DR

ST-ITO, Style Transfer with Inference-Time Optimization is introduced, an approach that instead searches the parameter space of an audio effect chain at inference to enable control of arbitrary audio effect chains, including unseen and non-differentiable effects.

Abstract

Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects.

ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time Optimization

TL;DR

ST-ITO, Style Transfer with Inference-Time Optimization is introduced, an approach that instead searches the parameter space of an audio effect chain at inference to enable control of arbitrary audio effect chains, including unseen and non-differentiable effects.

Abstract

Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects.

Paper Structure

This paper contains 14 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Style transfer with Inference-Time Optimization enables audio production style transfer through control of arbitrary audio effects. It employs a pretrained audio representation as a similarity metric, which is then optimized by searching the control parameter space of audio effects.
  • Figure 2: Self-supervised training for the pretext task where an audio signal ${\bm{x}}_i^d \sim \mathcal{D}_d$ is sampled randomly from one of $N$ datasets and then processed by a randomly sampled audio effect $E_m$ with an associated randomly sampled parameter preset $P_{m,l}$ to produce an output signal ${\bm{x}}_0^d$.
  • Figure 3: Representation learning via the pretext task where the input ${\bm{x}}_i^d$ and output ${\bm{x}}_o^d$ are processed by the encoder $g(\cdot)$ to produce embeddings. These embeddings are fed to a pair of MLP classifiers trained via cross-entropy that predict the effect class and preset class.
  • Figure 4: Accuracy for the style retrieval task using different audio representations across multiple source types with varying number of audio effects ($N$) and retrieval set size.
  • Figure 5: AFx-Rep similarity in real-world style transfer.
  • ...and 2 more figures