Table of Contents
Fetching ...

SCRAPL: Scattering Transform with Random Paths for Machine Learning

Christopher Mitcheltree, Vincent Lostanlen, Emmanouil Benetos, Mathieu Lagrange

TL;DR

SCRAPL tackles the prohibitive cost of differentiable multivariable scattering losses by sampling random ST paths and stabilizing the gradient with path-wise optimizers ($\mathcal{P}$-Adam, $\mathcal{P}$-SAGA) and a perceptually informed $\theta$-importance samplingInit. The method is evaluated on JTFS-based unsupervised sound matching across granular, chirplet, and TR-808 synthesis tasks, showing accuracy close to full JTFS but with runtimes near MSS. A key theoretical result is that uniformly sampled path gradients provide an unbiased estimate of the full ST gradient, and practical gains come from combining per-path momentum strategies and biased path sampling. The work enables scalable, differentiable audio optimization and points to future extensions to other ST families and multimodal inverse problems.

Abstract

The Euclidean distance between wavelet scattering transform coefficients (known as paths) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing. However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training. Against this problem, we propose "Scattering transform with Random Paths for machine Learning" (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms. We implement SCRAPL for the joint time-frequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures. We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR-808 drum machine. We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance. We make our code and audio samples available and provide SCRAPL as a Python package.

SCRAPL: Scattering Transform with Random Paths for Machine Learning

TL;DR

SCRAPL tackles the prohibitive cost of differentiable multivariable scattering losses by sampling random ST paths and stabilizing the gradient with path-wise optimizers (-Adam, -SAGA) and a perceptually informed -importance samplingInit. The method is evaluated on JTFS-based unsupervised sound matching across granular, chirplet, and TR-808 synthesis tasks, showing accuracy close to full JTFS but with runtimes near MSS. A key theoretical result is that uniformly sampled path gradients provide an unbiased estimate of the full ST gradient, and practical gains come from combining per-path momentum strategies and biased path sampling. The work enables scalable, differentiable audio optimization and points to future extensions to other ST families and multimodal inverse problems.

Abstract

The Euclidean distance between wavelet scattering transform coefficients (known as paths) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing. However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training. Against this problem, we propose "Scattering transform with Random Paths for machine Learning" (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms. We implement SCRAPL for the joint time-frequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures. We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR-808 drum machine. We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance. We make our code and audio samples available and provide SCRAPL as a Python package.
Paper Structure (22 sections, 1 theorem, 17 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 22 sections, 1 theorem, 17 equations, 6 figures, 12 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $\Phi = (\phi_{p})_{0}^{P-1}$ be a scattering transform with $P$ paths. Given a signal or image $\boldsymbol{x}$, let $F_{\boldsymbol{x}}$ be an autoencoder operating on $\boldsymbol{x}$ and let $\mathcal{L}_{\boldsymbol{x}}^{\Phi}$ be the associated ST reconstruction loss. Let $\mathcal{U}_P$ b

Figures (6)

  • Figure 1: Mean average synthesizer parameter error (y-axis) versus computational cost (x-axis) of unsupervised sound matching models for the granular synthesis task. Both axes are rescaled by the performance of a supervised model with the same number of parameters. Whiskers denote 95% CI, estimated over 20 random seeds. Due to computational limitations, JTFS-based sound matching is evaluated only once.
  • Figure 2: Left: JTFS vs. SCRAPL wall-clock training times on a single NVIDIA RTX A5000 GPU. Due to computational limitations, the JTFS method is only evaluated once. Right: Validation convergence graphs for the unsupervised granular synth sound matching task. Both: Shaded areas are 95% CI for 20 training runs using different random seeds.
  • Figure 3: Mean average JTFS perceptual audio distance (y-axis) versus computational cost (x-axis) of unsupervised sound matching models for the granular synthesis task. Both axes are rescaled by the performance of a supervised model with the same number of parameters. Whiskers denote 95% CI, estimated over 20 random seeds. Due to computational limitations, JTFS-based sound matching is evaluated only once.
  • Figure 4: Validation convergence graphs of SCRAPL ablations and the JTFS for the unsupervised granular synth sound matching task. Shaded areas are 95% CI for 20 training runs using different random seeds. Due to computational limitations, the JTFS method is only evaluated once.
  • Figure 5: SCRAPL $\boldsymbol{\theta_{\mathrm{synth}}} \; L_1$ validation values during training for four different AM / FM chirplet synths with two continuous $\boldsymbol{\theta_{\mathrm{synth}}}$ parameters: $\theta_{\mathrm{AM}}$ and $\theta_{\mathrm{FM}}$ (more details in Section \ref{['ssec:exp_chirplet']}). Blue is using the $\theta$-importance sampling initialization heuristic, and black is using uniform sampling. Shaded areas are 95% CI for 20 training runs using different random seeds.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 3.1