Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Pablo Alonso-Jiménez; Leonardo Pepino; Roser Batlle-Roca; Pablo Zinemanas; Dmitry Bogdanov; Xavier Serra; Martín Rocamora

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Pablo Alonso-Jiménez, Leonardo Pepino, Roser Batlle-Roca, Pablo Zinemanas, Dmitry Bogdanov, Xavier Serra, Martín Rocamora

TL;DR

PECMAE addresses interpretability in music audio classification by decoupling autoencoder training from prototype learning and leveraging a diffusion-based decoder for prototype sonification. Building on EnCodecMAE features, it uses a transformer-based encoder to produce a compact latent vector $z \in \mathbb{R}^{768}$ and a latent diffusion model to reconstruct associated audio, while a prototypical network learns class-specific prototypes with a loss $L = \lambda L_c + (1-\lambda) L_p$ where $L_p$ enforces proximity to in-class samples ($L_p = \frac{1}{M} \sum_{j=1}^{M} \min_i \|z_{xc,ij} - z_{p,ij}\|^2_2$). The method achieves competitive accuracy on instrument recognition (Medley-Solos-DB) and genre classification (GTZAN, XAI-Genre), with improvements over APNet, and provides interpretable insights through sonic prototyping. Decoder choice (latent diffusion vs. token-based) impacts reconstruction fidelity but preserves class information; sonification reveals predominant sonic textures guiding the classifier, offering practical interpretability for researchers and developers. Overall, PECMAE demonstrates that prototype-based interpretability can be scaled via self-supervised embeddings and generative decoding, with clear avenues for extending representations and sequence length.

Abstract

We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing representations with better generalization. APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples. In contrast, we explore using a diffusion decoder that allows reconstruction without such dependency. We evaluate our method on datasets for music instrument classification (Medley-Solos-DB) and genre recognition (GTZAN and a larger in-house dataset), the latter being a more challenging task not addressed with prototypical networks before. We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings, while the sonification of prototypes benefits understanding the behavior of the classifier.

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

TL;DR

and a latent diffusion model to reconstruct associated audio, while a prototypical network learns class-specific prototypes with a loss

where

enforces proximity to in-class samples (

). The method achieves competitive accuracy on instrument recognition (Medley-Solos-DB) and genre classification (GTZAN, XAI-Genre), with improvements over APNet, and provides interpretable insights through sonic prototyping. Decoder choice (latent diffusion vs. token-based) impacts reconstruction fidelity but preserves class information; sonification reveals predominant sonic textures guiding the classifier, offering practical interpretability for researchers and developers. Overall, PECMAE demonstrates that prototype-based interpretability can be scaled via self-supervised embeddings and generative decoding, with clear avenues for extending representations and sequence length.

Abstract

Paper Structure (14 sections, 1 equation, 1 figure, 2 tables)

This paper contains 14 sections, 1 equation, 1 figure, 2 tables.

Introduction
Related work
Audio Prototype Network
EnCodecMAE
Method
Generative autoencoder
Prototypical network
Experiments and results
Datasets
Implementation details
Classification Results
Effect of the decoder
Sonifying the prototypes
Conclusions and future work

Figures (1)

Figure 1: Diagram of the proposed PECMAE model. The colored boxes indicate trainable modules.

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

TL;DR

Abstract

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Authors

TL;DR

Abstract

Table of Contents

Figures (1)