Table of Contents
Fetching ...

Instabilities in Convnets for Raw Audio

Daniel Haider, Vincent Lostanlen, Martin Ehler, Peter Balazs

TL;DR

This work analyzes why waveform-based convnets with random Gaussian initialization can be unstable for audio, by developing large-deviation theory for the energy response of FIR filterbanks. By modeling the filterbank as a random matrix with $J$ Gaussian filters of length $T$ and setting $\sigma^2=(JT)^{-1}$, the authors show $\mathbb{E}[\|\Phi x\|^2] = \|x\|^2$ while the variance $\mathbb{V}[\|\Phi x\|^2]$ depends on input autocorrelation $R_{xx}$, making energy fluctuations larger for locally periodic signals. They connect this to frame theory, deriving frame bounds $A,B$ and showing $\|\Phi x\|^2$ can be written in terms of chi-squared variables; they further approximate the extreme bounds and propose a scaling law $\tilde{\kappa}(J,T)$ that suggests keeping $J \propto \log T$ to stabilize conditioning, a behavior reminiscent of discrete wavelet bases. The results imply practical regularization strategies and initialization considerations to mitigate instabilities in audio convnets, highlighting that many short filters tend to yield better numerical stability than few long filters. Overall, the paper provides a rigorous link between random filterbank initialization, input autocorrelation, and stability, with implications for designing learnable 1-D convnets for audio tasks.

Abstract

What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal approximations. In our article, we approach this phenomenon from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.

Instabilities in Convnets for Raw Audio

TL;DR

This work analyzes why waveform-based convnets with random Gaussian initialization can be unstable for audio, by developing large-deviation theory for the energy response of FIR filterbanks. By modeling the filterbank as a random matrix with Gaussian filters of length and setting , the authors show while the variance depends on input autocorrelation , making energy fluctuations larger for locally periodic signals. They connect this to frame theory, deriving frame bounds and showing can be written in terms of chi-squared variables; they further approximate the extreme bounds and propose a scaling law that suggests keeping to stabilize conditioning, a behavior reminiscent of discrete wavelet bases. The results imply practical regularization strategies and initialization considerations to mitigate instabilities in audio convnets, highlighting that many short filters tend to yield better numerical stability than few long filters. Overall, the paper provides a rigorous link between random filterbank initialization, input autocorrelation, and stability, with implications for designing learnable 1-D convnets for audio tasks.

Abstract

What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal approximations. In our article, we approach this phenomenon from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.
Paper Structure (10 sections, 8 theorems, 33 equations, 4 figures)

This paper contains 10 sections, 8 theorems, 33 equations, 4 figures.

Key Result

Proposition 2.1

Let $x\in \mathbb{R}^N$ and $\Phi$ a random filterbank with $J$ i.i.d. filters $w_{\!j}\sim \mathcal{N}(0,\sigma^2 I)$ of length $T\leq N$. Then expectation and variance of $\Vert\Phi x\Vert^2$ satisfy

Figures (4)

  • Figure 1: Autocorrelation in the input signal $x$ increases the variance of the filterbank response energy $\Vert \Phi x \Vert^2$ across random initializations. We compare audio signals with different autocorrelation profiles. Left to right: Snare (low), speech (medium), and flute (high). Top: Spectrograms of the signals. Bottom: Empirical histogram of $\Vert \Phi x \Vert^2$ for 1000 independent realizations of $\Phi$.
  • Figure 2: Large deviations of filterbank response energy ($\Vert \Phi x \Vert^2 -\Vert x \Vert^2$) for three synthetic signals of length $N=1024$ (top) and three natural signals of length $N=22050$ (bottom). Blue: empirical mean and $95^{\textrm{th}}$ percentile across $1000$ realizations of $\Phi$. We show two theoretical bounds from Proposition \ref{['prop:cheb']}: Cantelli (Equation \ref{['eq:cheb']}, orange) and Chernoff (Equation \ref{['eq:cher']}, green). Each filterbank contains $J=10$ filters of length $T=2^k$ where $3\leq k\leq10$.
  • Figure 3: Empirical means $\overline{A}$ and $\overline{B}$ (solid lines) and $95^{\mathrm{th}}$ percentiles (shaded area) of frame bounds $A$ and $B$ for $1000$ instances of $\Phi$ with $\sigma^2=(TJ)^{-1}$, $J=40$ and different values of $T$. Dashed lines denote the bounds of $\mathbb{E}[A]$ and $\mathbb{E}[B]$ from Theorem \ref{['thm:expected-frame-bounds']}. Dotted lines denote the asymptotic bounds in \ref{['eq:kappa_tilde']}.
  • Figure 4: We denote by $\overline{A},\overline{B}$, and $\overline{\kappa}$ the empirical means of the respective quantities over $1000$ instances of $\Phi$ with $\sigma^2=(TJ)^{-1}$. Top: Comparison of $\overline{\kappa}$ (solid) and $\overline{B}/\overline{A}$ (dashed) for increasing filter length $T$ and different values of $J$. Bottom: Empirical mean $\overline{\kappa}$ for increasing numbers of filters $J$ and different values $T$. For $J=\log_2 T$ (solid black), $\overline{\kappa}$ remains approximately constant.

Theorems & Definitions (15)

  • Proposition 2.1
  • Lemma 2.2
  • proof : Proof of Proposition \ref{['prop:exp']}
  • Proposition 2.3: Cantelli bound
  • Proposition 2.4: Chernoff bound
  • Theorem 3.1
  • proof
  • Proposition 3.2
  • Proposition 5.1
  • proof
  • ...and 5 more