Table of Contents
Fetching ...

Audio Explanation Synthesis with Generative Foundation Models

Alican Akman, Qiyang Sun, Björn W. Schuller

TL;DR

This work addresses the problem of explaining audio foundation models by moving feature attribution from the input space to a latent embedding space learned by autoencoder-based foundation models. They compute latent attributions with $Z = Encoder(X)$ and $att = Theta(Classifier(Z))$, then synthesize explanations by decoding a modified latent vector $X_Theta = Decoder(Z_Theta)$. The approach is validated on keyword spotting and speech emotion recognition, showing higher fidelity explanations that capture meaningful high-level audio components. The work provides a latent-space explainability framework that enables interpretable, listenable audio explanations and can support model debugging and justification, with potential extensions to advanced audio generative models.

Abstract

The increasing success of audio foundation models across various tasks has led to a growing need for improved interpretability to understand their intricate decision-making processes better. Existing methods primarily focus on explaining these models by attributing importance to elements within the input space based on their influence on the final decision. In this paper, we introduce a novel audio explanation method that capitalises on the generative capacity of audio foundation models. Our method leverages the intrinsic representational power of the embedding space within these models by integrating established feature attribution techniques to identify significant features in this space. The method then generates listenable audio explanations by prioritising the most important features. Through rigorous benchmarking against standard datasets, including keyword spotting and speech emotion recognition, our model demonstrates its efficacy in producing audio explanations.

Audio Explanation Synthesis with Generative Foundation Models

TL;DR

This work addresses the problem of explaining audio foundation models by moving feature attribution from the input space to a latent embedding space learned by autoencoder-based foundation models. They compute latent attributions with and , then synthesize explanations by decoding a modified latent vector . The approach is validated on keyword spotting and speech emotion recognition, showing higher fidelity explanations that capture meaningful high-level audio components. The work provides a latent-space explainability framework that enables interpretable, listenable audio explanations and can support model debugging and justification, with potential extensions to advanced audio generative models.

Abstract

The increasing success of audio foundation models across various tasks has led to a growing need for improved interpretability to understand their intricate decision-making processes better. Existing methods primarily focus on explaining these models by attributing importance to elements within the input space based on their influence on the final decision. In this paper, we introduce a novel audio explanation method that capitalises on the generative capacity of audio foundation models. Our method leverages the intrinsic representational power of the embedding space within these models by integrating established feature attribution techniques to identify significant features in this space. The method then generates listenable audio explanations by prioritising the most important features. Through rigorous benchmarking against standard datasets, including keyword spotting and speech emotion recognition, our model demonstrates its efficacy in producing audio explanations.

Paper Structure

This paper contains 11 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An overview of our method: The top row depicts the role of a foundation model with autoencoder architecture. The bottom row shows the process of explaining a task-specific classifier model including finding important features in the latent space and generating audio explanations based on these features.
  • Figure 2: Sample spectrogram visualisations for the qualitative audio experiments: (a) Neutral audio of the word ("Rain"), (b) Happy audio of the word ("Rain"), (c) Explanation-removed audio from (b), (d) Explanation audio generated from (b).
  • Figure 3: Confusion matrix for the classifier for TESS dataset after explanation removal by a ratio of $\beta = 0.1$.