Table of Contents
Fetching ...

Wavelet Networks: Scale-Translation Equivariant Learning From Raw Time-Series

David W. Romero, Erik J. Bekkers, Jakub M. Tomczak, Mark Hoogendoorn

TL;DR

This paper introduces Wavelet Networks, a time-series model family that preserves scale and translation symmetries via lifting and group convolutions on the scale-translation group. By parameterizing convolutional kernels on continuous bases (B^2-splines) and discretizing scales with a dyadic grid, the authors implement wavelet-like, nested time-frequency transforms across layers, yielding strong performance on raw waveforms without explicit spectrogram preprocessing. Empirical results across environmental sounds, music tagging, and bearing fault detection show Wavelet Networks outperform standard CNNs on raw signals and match or exceed spectrogram-based methods with far fewer parameters, validating the approach and its practical impact. The work also connects the scale-translation transform to the classical wavelet transform, offering a principled, symmetry-preserving alternative to conventional spectro-temporal processing in time-series learning.

Abstract

Leveraging the symmetries inherent to specific data domains for the construction of equivariant neural networks has lead to remarkable improvements in terms of data efficiency and generalization. However, most existing research focuses on symmetries arising from planar and volumetric data, leaving a crucial data source largely underexplored: time-series. In this work, we fill this gap by leveraging the symmetries inherent to time-series for the construction of equivariant neural network. We identify two core symmetries: *scale and translation*, and construct scale-translation equivariant neural networks for time-series learning. Intriguingly, we find that scale-translation equivariant mappings share strong resemblance with the wavelet transform. Inspired by this resemblance, we term our networks Wavelet Networks, and show that they perform nested non-linear wavelet-like time-frequency transforms. Empirical results show that Wavelet Networks outperform conventional CNNs on raw waveforms, and match strongly engineered spectrogram techniques across several tasks and time-series types, including audio, environmental sounds, and electrical signals. Our code is publicly available at https://github.com/dwromero/wavelet_networks.

Wavelet Networks: Scale-Translation Equivariant Learning From Raw Time-Series

TL;DR

This paper introduces Wavelet Networks, a time-series model family that preserves scale and translation symmetries via lifting and group convolutions on the scale-translation group. By parameterizing convolutional kernels on continuous bases (B^2-splines) and discretizing scales with a dyadic grid, the authors implement wavelet-like, nested time-frequency transforms across layers, yielding strong performance on raw waveforms without explicit spectrogram preprocessing. Empirical results across environmental sounds, music tagging, and bearing fault detection show Wavelet Networks outperform standard CNNs on raw signals and match or exceed spectrogram-based methods with far fewer parameters, validating the approach and its practical impact. The work also connects the scale-translation transform to the classical wavelet transform, offering a principled, symmetry-preserving alternative to conventional spectro-temporal processing in time-series learning.

Abstract

Leveraging the symmetries inherent to specific data domains for the construction of equivariant neural networks has lead to remarkable improvements in terms of data efficiency and generalization. However, most existing research focuses on symmetries arising from planar and volumetric data, leaving a crucial data source largely underexplored: time-series. In this work, we fill this gap by leveraging the symmetries inherent to time-series for the construction of equivariant neural network. We identify two core symmetries: *scale and translation*, and construct scale-translation equivariant neural networks for time-series learning. Intriguingly, we find that scale-translation equivariant mappings share strong resemblance with the wavelet transform. Inspired by this resemblance, we term our networks Wavelet Networks, and show that they perform nested non-linear wavelet-like time-frequency transforms. Empirical results show that Wavelet Networks outperform conventional CNNs on raw waveforms, and match strongly engineered spectrogram techniques across several tasks and time-series types, including audio, environmental sounds, and electrical signals. Our code is publicly available at https://github.com/dwromero/wavelet_networks.

Paper Structure

This paper contains 31 sections, 37 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Equivariance, invariance and their impact on the hierarchical representations. In a group equivariant mapping, when the input is transformed by a group transformation, its output undergoes an equivalent transformation (Fig. \ref{['fig:equivariant_map']}). In contrast, in group invariant maps, the output remains unchanged for all group transformations of the input (Fig. \ref{['fig:invariant_map']}). This distinction holds significant implications in the construction of hierarchical feature representations. For example, a face recognition system built upon invariant eye, nose and mouth detectors would be unable to set the portraits in Fig. \ref{['fig:faces_example']} apart. However, by leveraging equivariant mappings, information about the input transformations can be used to distinguish these portraits effectively. In essence, in contrast to equivariant maps, invariant maps permit senseless pattern combinations resulting for overly restraining constraints in their design.
  • Figure 2: Locality of visual and auditory objects. Whereas visual objects are local (left), auditory objects are not. The latter often cover large parts of the frequency axis in a sparse manner (right).
  • Figure 3: Occlusion and superposition. Visual objects occlude each other when they appear simultaneously at a given position (left). Auditory objects, instead, superpose at all shared positions (right).
  • Figure 4: Scale-translation lifting and group convolution. The lifting convolution can be seen a set of 1$\mathrm{D}$ convolutions with a bank of scaled convolutional kernels $\frac{1}{s}{\mathcal{L}}_s \psi$, and the group convolution can be seen as a set of 1$\mathrm{D}$ convolutions with a bank of scaled convolutional kernels $\frac{1}{s^2}{\mathcal{L}}_s\psi$, followed by an integral over scales $\varsigma \in {\mathbb{R}}$. Their main difference is that, for group convolutions, the input $f$ and the convolutional kernel $\psi$ are functions on the scale-translation group whereas for lifting convolutions these are functions on ${\mathbb{R}}$. Lifting and group convolutions can be seen as spectro-temporal decompositions with large values of $s$ relating to coarse features and small values to finer features.
  • Figure 5: Wavelet networks.
  • ...and 3 more figures