Table of Contents
Fetching ...

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Pablo Alonso-Jiménez, Leonardo Pepino, Roser Batlle-Roca, Pablo Zinemanas, Dmitry Bogdanov, Xavier Serra, Martín Rocamora

TL;DR

PECMAE addresses interpretability in music audio classification by decoupling autoencoder training from prototype learning and leveraging a diffusion-based decoder for prototype sonification. Building on EnCodecMAE features, it uses a transformer-based encoder to produce a compact latent vector $z \in \mathbb{R}^{768}$ and a latent diffusion model to reconstruct associated audio, while a prototypical network learns class-specific prototypes with a loss $L = \lambda L_c + (1-\lambda) L_p$ where $L_p$ enforces proximity to in-class samples ($L_p = \frac{1}{M} \sum_{j=1}^{M} \min_i \|z_{xc,ij} - z_{p,ij}\|^2_2$). The method achieves competitive accuracy on instrument recognition (Medley-Solos-DB) and genre classification (GTZAN, XAI-Genre), with improvements over APNet, and provides interpretable insights through sonic prototyping. Decoder choice (latent diffusion vs. token-based) impacts reconstruction fidelity but preserves class information; sonification reveals predominant sonic textures guiding the classifier, offering practical interpretability for researchers and developers. Overall, PECMAE demonstrates that prototype-based interpretability can be scaled via self-supervised embeddings and generative decoding, with clear avenues for extending representations and sequence length.

Abstract

We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing representations with better generalization. APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples. In contrast, we explore using a diffusion decoder that allows reconstruction without such dependency. We evaluate our method on datasets for music instrument classification (Medley-Solos-DB) and genre recognition (GTZAN and a larger in-house dataset), the latter being a more challenging task not addressed with prototypical networks before. We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings, while the sonification of prototypes benefits understanding the behavior of the classifier.

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

TL;DR

PECMAE addresses interpretability in music audio classification by decoupling autoencoder training from prototype learning and leveraging a diffusion-based decoder for prototype sonification. Building on EnCodecMAE features, it uses a transformer-based encoder to produce a compact latent vector and a latent diffusion model to reconstruct associated audio, while a prototypical network learns class-specific prototypes with a loss where enforces proximity to in-class samples (). The method achieves competitive accuracy on instrument recognition (Medley-Solos-DB) and genre classification (GTZAN, XAI-Genre), with improvements over APNet, and provides interpretable insights through sonic prototyping. Decoder choice (latent diffusion vs. token-based) impacts reconstruction fidelity but preserves class information; sonification reveals predominant sonic textures guiding the classifier, offering practical interpretability for researchers and developers. Overall, PECMAE demonstrates that prototype-based interpretability can be scaled via self-supervised embeddings and generative decoding, with clear avenues for extending representations and sequence length.

Abstract

We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing representations with better generalization. APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples. In contrast, we explore using a diffusion decoder that allows reconstruction without such dependency. We evaluate our method on datasets for music instrument classification (Medley-Solos-DB) and genre recognition (GTZAN and a larger in-house dataset), the latter being a more challenging task not addressed with prototypical networks before. We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings, while the sonification of prototypes benefits understanding the behavior of the classifier.
Paper Structure (14 sections, 1 equation, 1 figure, 2 tables)

This paper contains 14 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Diagram of the proposed PECMAE model. The colored boxes indicate trainable modules.