Table of Contents
Fetching ...

Listenable Maps for Audio Classifiers

Francesco Paissan, Mirco Ravanelli, Cem Subakan

TL;DR

The paper tackles the challenge of interpreting audio classifiers by introducing Listenable Maps for Audio Classifiers (L-MAC), a posthoc method that learns a decoder to produce a binary mask on the linear spectrogram. By masking the spectrogram and reconstructing a listenable waveform via ISTFT using the original signal phase, L-MAC yields interpretable, audibly understandable explanations. The approach optimizes a masking objective that preserves faithfulness to the classifier while encouraging concise, informative masks, with optional finetuning to improve audio quality. Across in-domain and out-of-domain evaluations, L-MAC delivers superior faithfulness metrics and higher user preference compared to baselines like gradient-based saliency methods and L2I, demonstrating practical impact for human-centric audio model interpretation.

Abstract

Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.

Listenable Maps for Audio Classifiers

TL;DR

The paper tackles the challenge of interpreting audio classifiers by introducing Listenable Maps for Audio Classifiers (L-MAC), a posthoc method that learns a decoder to produce a binary mask on the linear spectrogram. By masking the spectrogram and reconstructing a listenable waveform via ISTFT using the original signal phase, L-MAC yields interpretable, audibly understandable explanations. The approach optimizes a masking objective that preserves faithfulness to the classifier while encouraging concise, informative masks, with optional finetuning to improve audio quality. Across in-domain and out-of-domain evaluations, L-MAC delivers superior faithfulness metrics and higher user preference compared to baselines like gradient-based saliency methods and L2I, demonstrating practical impact for human-centric audio model interpretation.

Abstract

Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.
Paper Structure (14 sections, 9 equations, 5 figures, 5 tables)

This paper contains 14 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: L-MAC Architecture. First, the linear spectrogram $X$ is computed from an audio waveform $x$. Then, the audio features used by the pretrained classifier (e.g., FBANKs) are extracted (Input Tf). The classifier generates class predictions $\hat{y}$ and its latent representations $h$ are input to the decoder, which produces a binary mask $M$ for selecting specific portions of the original linear spectrogram $X$. The listenable interpretation is generated by applying the Inverse Short-Time Fourier Transform (ISTFT) on the masked spectrogram $X \odot M$ with the phase inherited from the original audio waveform. The masked loss used to train the decoder is computed based on the classifier predictions on the masked spectrogram and the predicted class $\hat{y}.$
  • Figure 2: The Mean Opinion Scores (MOS) obtained in the user study. (Left) MOS values obtained on recordings from L2I companion website (Right) MOS values obtained on newly created random recordings with two sound classes.
  • Figure 3: Example demonstrating the behaviour of L-MAC during the MRT test. From left to right: original sample, interpretation, and interpretations generated by randomizing the weights of the convolutional blocks starting from the logits in a cascading fashion, as suggested in adebayo2020sanity. As expected, the interpretations are corrupted by randomizing the weights of the model. From top to bottom: L-MAC, L-MAC finetuned with $\lambda_g=4$ and CCT$=0.6$, L-MAC finetuned with $\lambda_g=4$ and $\text{CCT}=0.7$, and GradCAM.
  • Figure 4: Sanity checks for saliency maps: (left) RemOve And Retrain test. The presented results are the averages over three runs. The dashed line represents the random attribution baseline. (right) Structured Similarity Index (SSIM) extracted using the Model Randomization Test.
  • Figure 5: Diagram of the decoder neural network $M_\theta(\cdot)$. The representations from the classifier $f(\cdot)$ are fed through different layers of the decoder $M_\theta(\cdot)$ via skip connections.