Table of Contents
Fetching ...

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal

TL;DR

This paper tackles the challenge of non-linear latent spaces in high-compression audio autoencoders by introducing an unsupervised data-augmentation method that induces approximate linearity through implicit regularization. The approach enforces homogeneity and additivity, requiring Dec_theta(a z) ≈ a Dec_theta(z) and Dec_theta(z_u+ z_v) ≈ Dec_theta(z_u) + Dec_theta(z_v), without modifying the model or loss function. Applied to the Music2Latent CAE, the method uses random latent gains and artificial mixtures within a consistency-training framework, yielding linear encoder and decoder behavior while preserving reconstruction quality. Empirically, Lin-CAE improves latent arithmetic and oracle music source separation, outperforming baselines in additivity and separation metrics, and enabling more interpretable, controllable audio manipulation in compressed space.

Abstract

Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

TL;DR

This paper tackles the challenge of non-linear latent spaces in high-compression audio autoencoders by introducing an unsupervised data-augmentation method that induces approximate linearity through implicit regularization. The approach enforces homogeneity and additivity, requiring Dec_theta(a z) ≈ a Dec_theta(z) and Dec_theta(z_u+ z_v) ≈ Dec_theta(z_u) + Dec_theta(z_v), without modifying the model or loss function. Applied to the Music2Latent CAE, the method uses random latent gains and artificial mixtures within a consistency-training framework, yielding linear encoder and decoder behavior while preserving reconstruction quality. Empirically, Lin-CAE improves latent arithmetic and oracle music source separation, outperforming baselines in additivity and separation metrics, and enabling more interpretable, controllable audio manipulation in compressed space.

Abstract

Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: In a linear decoder, applying a gain to the latent vector scales the output by the same gain (homogeneity), and summing latents corresponds to a sum in the audio domain (additivity).
  • Figure 2: (\ref{['fig:part1']}): Music2Latent CAE architecture. The decoder is a denoising U-Net and the latent is introduced to it at every resolution level after learned upsampling. (\ref{['fig:part2']}): Proposed CAE training trick to implicitly enforce homogeneity in the decoder. (\ref{['fig:part3']}): Proposed trick to enforce additivity, applied when the input is an artificial mixture. (\ref{['fig:part4']}): Batch creation procedure with artificial mixtures of mixtures.
  • Figure 3: Oracle Music Source Separation via latent arithmetic.