Table of Contents
Fetching ...

Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders

Mathias Rose Bjare, Giorgia Cantisani, Marco Pasini, Stefan Lattner, Gerhard Widmer

TL;DR

This work addresses how to induce a perceptual hierarchy in music representations by training autoencoders with noise-augmented latents and perceptual losses, enabling coarse latent structures to carry salient perceptual information. It integrates a two-stage latent diffusion framework (CAE-based encoding and a rectified-flow autoregressive model) and introduces fixed-latent-variance noise strategies to reinforce hierarchical alignment. Empirically, perceptually aligned latent spaces improve musical surprisal estimation and neural encoding (EEG) of music, with the best performance at intermediate noise levels and dependent on bottleneck choice (LayerNorm vs TanH). The findings suggest that aligning coarse latent structures with perceptual features enhances diffusion-based decoding tasks and could generalize to other audio-cognition applications.

Abstract

We argue that training autoencoders to reconstruct inputs from noised versions of their encodings, when combined with perceptual losses, yields encodings that are structured according to a perceptual hierarchy. We demonstrate the emergence of this hierarchical structure by showing that, after training an audio autoencoder in this manner, perceptually salient information is captured in coarser representation structures than with conventional training. Furthermore, we show that such perceptual hierarchies improve latent diffusion decoding in the context of estimating surprisal in music pitches and predicting EEG-brain responses to music listening. Pretrained weights are available on github.com/CPJKU/pa-audioic.

Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders

TL;DR

This work addresses how to induce a perceptual hierarchy in music representations by training autoencoders with noise-augmented latents and perceptual losses, enabling coarse latent structures to carry salient perceptual information. It integrates a two-stage latent diffusion framework (CAE-based encoding and a rectified-flow autoregressive model) and introduces fixed-latent-variance noise strategies to reinforce hierarchical alignment. Empirically, perceptually aligned latent spaces improve musical surprisal estimation and neural encoding (EEG) of music, with the best performance at intermediate noise levels and dependent on bottleneck choice (LayerNorm vs TanH). The findings suggest that aligning coarse latent structures with perceptual features enhances diffusion-based decoding tasks and could generalize to other audio-cognition applications.

Abstract

We argue that training autoencoders to reconstruct inputs from noised versions of their encodings, when combined with perceptual losses, yields encodings that are structured according to a perceptual hierarchy. We demonstrate the emergence of this hierarchical structure by showing that, after training an audio autoencoder in this manner, perceptually salient information is captured in coarser representation structures than with conventional training. Furthermore, we show that such perceptual hierarchies improve latent diffusion decoding in the context of estimating surprisal in music pitches and predicting EEG-brain responses to music listening. Pretrained weights are available on github.com/CPJKU/pa-audioic.

Paper Structure

This paper contains 12 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Perceptual quality metrics for reconstructions of aligned latents $NT=E,D$ and unaligned latents $NT=\emptyset$ and $NT=D$.
  • Figure 2: Cortical tracking of IC computed with aligned and unaligned latents across different noise levels. $\Delta r$ denotes the increase in prediction accuracy when comparing a full model (IC + acoustic envelope) with a reduced model including only the envelope. Bar plots report the mean ± SE across participants (median across electrodes, average across trials). Scalp topographies report $\Delta r$ for individual channels (only significant channels are shown, significance threshold at $p<0.05$).
  • Figure 3: SI-SDR, ViSQOL, $\text{FAD}_{\text{CLAP}}$ and $\text{FAD}_{\text{VGGish}}$, where encoder and decoder are trained with noised-latents ($D,E$), only decoder ($D$), and the base model ($\emptyset$). We show this using the original encoder bottleneck activation of the CAE (TanH) and an alternative (LayerNorm), with fixed latent variance. We provide results for two different noise levels, specified by the logit-normal's mean value $m$ (where lower values correspond to more noise).
  • Figure 4: Correlation with IDyOM pitch surprisal for models trained with different latent noise strengths and different bottleneck activation functions.
  • Figure 5: Neural encoding of ICs computed for different models and noise levels. $\Delta r$ denotes the increase in prediction accuracy when comparing a full model (IC + acoustic envelope) with a reduced model including only the envelope. Bar plots report the mean ± SE across participants (median across electrodes, average across trials).