Table of Contents
Fetching ...

NoiseBandNet: Controllable Time-Varying Neural Synthesis of Sound Effects Using Filterbanks

Adrián Barahona-Ríos, Tom Collins

TL;DR

NoiseBandNet is proposed, an architecture capable of synthesising and controlling sound effects by filtering white noise through a filterbank, thus going further than previous systems that make assumptions about the harmonic nature of sounds.

Abstract

Controllable neural audio synthesis of sound effects is a challenging task due to the potential scarcity and spectro-temporal variance of the data. Differentiable digital signal processing (DDSP) synthesisers have been successfully employed to model and control musical and harmonic signals using relatively limited data and computational resources. Here we propose NoiseBandNet, an architecture capable of synthesising and controlling sound effects by filtering white noise through a filterbank, thus going further than previous systems that make assumptions about the harmonic nature of sounds. We evaluate our approach via a series of experiments, modelling footsteps, thunderstorm, pottery, knocking, and metal sound effects. Comparing NoiseBandNet audio reconstruction capabilities to four variants of the DDSP-filtered noise synthesiser, NoiseBandNet scores higher in nine out of ten evaluation categories, establishing a flexible DDSP method for generating time-varying, inharmonic sound effects of arbitrary length with both good time and frequency resolution. Finally, we introduce some potential creative uses of NoiseBandNet, by generating variations, performing loudness transfer, and by training it on user-defined control curves.

NoiseBandNet: Controllable Time-Varying Neural Synthesis of Sound Effects Using Filterbanks

TL;DR

NoiseBandNet is proposed, an architecture capable of synthesising and controlling sound effects by filtering white noise through a filterbank, thus going further than previous systems that make assumptions about the harmonic nature of sounds.

Abstract

Controllable neural audio synthesis of sound effects is a challenging task due to the potential scarcity and spectro-temporal variance of the data. Differentiable digital signal processing (DDSP) synthesisers have been successfully employed to model and control musical and harmonic signals using relatively limited data and computational resources. Here we propose NoiseBandNet, an architecture capable of synthesising and controlling sound effects by filtering white noise through a filterbank, thus going further than previous systems that make assumptions about the harmonic nature of sounds. We evaluate our approach via a series of experiments, modelling footsteps, thunderstorm, pottery, knocking, and metal sound effects. Comparing NoiseBandNet audio reconstruction capabilities to four variants of the DDSP-filtered noise synthesiser, NoiseBandNet scores higher in nine out of ten evaluation categories, establishing a flexible DDSP method for generating time-varying, inharmonic sound effects of arbitrary length with both good time and frequency resolution. Finally, we introduce some potential creative uses of NoiseBandNet, by generating variations, performing loudness transfer, and by training it on user-defined control curves.
Paper Structure (16 sections, 4 equations, 8 figures, 1 table)

This paper contains 16 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Reconstruction task comparison between the DDSP time-varying FIR noise synthesiser and NoiseBandNet. The top row shows the waveform of the entire sound, the middle row its log-magnitude spectrogram and at the bottom a detail of the transient. The transient spot is annotated with a vertical dashed line in the first and third rows. The left column shows the original training sample: a short metal impact. The middle columns show the reconstruction of five different configurations of the DDSP time-varying FIR noise synthesiser with 128, 512, 1024, 4096 and 8192 taps respectively, all of them with a hop size of 32 samples. Observe its time and frequency trade-off: the frequency resolution increases with the number of taps at the same time the time resolution decreases, and vice-versa. The right column shows the NoiseBandNet reconstruction using 2048 filters and a synthesis window size of 32 samples, maintaining both good time and frequency resolution.
  • Figure 2: Detail of the frequency response of some of the filters employed in a 2048-filter filterbank. Each of the filters is represented by a different colour.
  • Figure 3: Loopable noise bands. Two instances of the same noise band are concatenated along their $x$-axis. The top figure shows the waveform of both noise bands, one after the other, each one with a distinctive colour. The bottom figure shows the detail of the point where the end of the first noise band instance meets the start of the second one. Notice how, thanks to circular convolution, the start and the end of the segments are "joined up".
  • Figure 4: Overview of the NoiseBandNet architecture and training process. In this case, loudness and spectral centroid features are extracted from the training audio and passed to the network, which predicts an $M$-band matrix of time-varying amplitudes at a $\text{Fs}$ sampling rate divided by a synthesis window size $W$. Depending on the control scheme, these features or control parameters may be different (e.g., only loudness or user-provided control parameters). The predicted amplitudes are upsampled using linear interpolation by a factor of $W$ to match the audio length, and multiplied by the $M$ noise bands. The output audio is generated by summing all the bands together. Finally, the model is optimised by comparing the resulting sound against the target audio using a multi-resolution STFT (MRSTFT) loss.
  • Figure 5: Log-magnitude spectrograms from the result of the different randomisation schemes. The left column represents a non-randomised (just reconstructed) sound: a metal impact. The second and third columns show two examples of the resulting randomised sound. In the first row we employ the top $k$ randomisation scheme using $L_{\text{frame}}=430$ (3 frames), $k=100$ and a randomised multiplier in a $[0.0, 1.0]$ range. The second row depicts the frequency shift randomisation scheme with $L_{\text{frame}}=645$ (2 frames), $f_{\text{init}}=30$ and $f_{\text{shift}}=3$. The third row shows both randomisation combined, using $L_{\text{frame}}=645$ (2 frames), $k=100$, a $[0.0, 1.0]$ multiplier, $f_{\text{init}}=30$ and $f_{\text{shift}}=3$.
  • ...and 3 more figures