Table of Contents
Fetching ...

Listenable Maps for Zero-Shot Audio Classifiers

Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

TL;DR

This work introduces LMAC-ZS, a decoder-based post-hoc interpreter for zero-shot audio classifiers built on the CLAP cross-modal model. The method learns masks that preserve text-audio similarities under masking, enabling faithful explanations in Mel, STFT, or raw audio domains. Through extensive quantitative and qualitative evaluations, LMAC-ZS consistently demonstrates superior faithfulness compared to baselines like GradCAM++, and can generate prompt-driven, listenable explanations that align with model decisions. The approach advances transparent zero-shot audio classification with potential practical impact in settings such as healthcare, while acknowledging limitations and areas for future work.

Abstract

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.

Listenable Maps for Zero-Shot Audio Classifiers

TL;DR

This work introduces LMAC-ZS, a decoder-based post-hoc interpreter for zero-shot audio classifiers built on the CLAP cross-modal model. The method learns masks that preserve text-audio similarities under masking, enabling faithful explanations in Mel, STFT, or raw audio domains. Through extensive quantitative and qualitative evaluations, LMAC-ZS consistently demonstrates superior faithfulness compared to baselines like GradCAM++, and can generate prompt-driven, listenable explanations that align with model decisions. The approach advances transparent zero-shot audio classification with potential practical impact in settings such as healthcare, while acknowledging limitations and areas for future work.

Abstract

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.
Paper Structure (17 sections, 13 equations, 7 figures, 3 tables)

This paper contains 17 sections, 13 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (left) The training of the CLAP model for learning cross-modal representations. (right) Zero-shot classification with the CLAP model.
  • Figure 2: LMAC-ZS architecture. The input spectrogram (linear frequency) $X_i$ (the $i$-th audio in the batch) first of all passes through the transformations (InputTf block) to make it compatible with the input domain (e.g. Mel Spectra) of the audio encoder $f_\text{audio}(.)$, which yields the latent representations $h_i$. These representations along with the text representation $t_j$ (the $j$-th text prompt within the batch) are then fed to the decoder $M_\theta(. \, , .)$. The resulting mask is then element-wise multiplied with the input spectrogram $X_i$. The masked spectrogram $M \odot X_i$ is then converted back to the input domain of the audio encoder, and the similarity score $t^\top_i f_\text{audio}( M_\theta(t_i, h_j) \odot X_{\text{audio}, j} )$ is calculated, which is used in the overall training objective $\mathcal{L}_{ZS}(\theta)$.
  • Figure 3: (left) Mask-Mean vs Similarity for LMAC-ZS, (middle) Mask-Mean vs Similarity for GradCam++, (right) Model Randomization Test for LMAC-ZS and GradCam++.
  • Figure 4: Qualitative Comparisons of Explanations given by LMAC-ZS, and GradCAM++, for two different classes. We see that LMAC-ZS shuts-off the explanation depending on the similarity of the given prompt with the input audio, whereas GradCAM++ remains insensitive to the class label.
  • Figure 5: Visualization of Interpretations after Cascading Model Randomization. Left column is the input, second column is the original interpretation, and more we go towards the right more layers are randomized. Top row is for LMAC-ZS, and the bottom row is for GradCAM++.
  • ...and 2 more figures