Table of Contents
Fetching ...

Neural Style Transfer for Audio Spectograms

Prateek Verma, Julius O. Smith

TL;DR

The paper tackles generating novel audio by transferring style between sounds within a neural style transfer framework adapted from images. It optimizes a spectrogram generated from random noise to satisfy a pre-trained CNN's feature activations, minimizing $L_{\text{total}}=\alpha L_c+\beta L_s+\gamma L_e+\delta L_t$. The method uses an AlexNet variant with $3\times 3$ receptive fields trained on spectrograms of instrument sounds and augments with temporal and spectral energy-envelope losses to preserve dynamics, enabling reconstructable audio via Griffin-Lim. It demonstrates two cross-synthesis tasks—bandwidth compression and expansion with timbral transfer—using a single parameter setting, suggesting a flexible, DSP-free route to controllable audio synthesis.

Abstract

There has been fascinating work on creating artistic transformations of images by Gatys. This was revolutionary in how we can in some sense alter the 'style' of an image while generally preserving its 'content'. In our work, we present a method for creating new sounds using a similar approach, treating it as a style-transfer problem, starting from a random-noise input signal and iteratively using back-propagation to optimize the sound to conform to filter-outputs from a pre-trained neural architecture of interest. For demonstration, we investigate two different tasks, resulting in bandwidth expansion/compression, and timbral transfer from singing voice to musical instruments. A feature of our method is that a single architecture can generate these different audio-style-transfer types using the same set of parameters which otherwise require different complex hand-tuned diverse signal processing pipelines.

Neural Style Transfer for Audio Spectograms

TL;DR

The paper tackles generating novel audio by transferring style between sounds within a neural style transfer framework adapted from images. It optimizes a spectrogram generated from random noise to satisfy a pre-trained CNN's feature activations, minimizing . The method uses an AlexNet variant with receptive fields trained on spectrograms of instrument sounds and augments with temporal and spectral energy-envelope losses to preserve dynamics, enabling reconstructable audio via Griffin-Lim. It demonstrates two cross-synthesis tasks—bandwidth compression and expansion with timbral transfer—using a single parameter setting, suggesting a flexible, DSP-free route to controllable audio synthesis.

Abstract

There has been fascinating work on creating artistic transformations of images by Gatys. This was revolutionary in how we can in some sense alter the 'style' of an image while generally preserving its 'content'. In our work, we present a method for creating new sounds using a similar approach, treating it as a style-transfer problem, starting from a random-noise input signal and iteratively using back-propagation to optimize the sound to conform to filter-outputs from a pre-trained neural architecture of interest. For demonstration, we investigate two different tasks, resulting in bandwidth expansion/compression, and timbral transfer from singing voice to musical instruments. A feature of our method is that a single architecture can generate these different audio-style-transfer types using the same set of parameters which otherwise require different complex hand-tuned diverse signal processing pipelines.

Paper Structure

This paper contains 4 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: a) shows the Gaussian noise from which we start the input to optimize, b) Harp sound (content) c) Tuning Fork (style) and d) Neural Style transferred output with having content of harp and style of tuning fork https://youtu.be/UlwBsEigcdE
  • Figure 2: a) shows the Gaussian noise from which we start the input to optimize, b) Singing sound (content) c) Violin note (style) and d) Neural Style transferred output with having content of singing and style of violin. https://youtu.be/RpGBkfs24uc