Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization
Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal
TL;DR
This paper tackles the challenge of non-linear latent spaces in high-compression audio autoencoders by introducing an unsupervised data-augmentation method that induces approximate linearity through implicit regularization. The approach enforces homogeneity and additivity, requiring Dec_theta(a z) ≈ a Dec_theta(z) and Dec_theta(z_u+ z_v) ≈ Dec_theta(z_u) + Dec_theta(z_v), without modifying the model or loss function. Applied to the Music2Latent CAE, the method uses random latent gains and artificial mixtures within a consistency-training framework, yielding linear encoder and decoder behavior while preserving reconstruction quality. Empirically, Lin-CAE improves latent arithmetic and oracle music source separation, outperforming baselines in additivity and separation metrics, and enabling more interpretable, controllable audio manipulation in compressed space.
Abstract
Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.
