Table of Contents
Fetching ...

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

Martin Lebourdais, Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega

TL;DR

This work tackles the need for explainable audio segmentation by introducing an explainable-by-design multilabel segmentation model based on non-negative matrix factorization (NMF). A fixed dictionary $\mathbf{W}$ and a learnable non-negative embedding $\mathbf{H}$ are used to reconstruct the spectrogram via $\tilde{\mathbf{X}}=\mathbf{W}\mathbf{H}$ and to predict frame-level labels through $\hat{\mathbf{y}}=\boldsymbol{\theta}\mathbf{H}$, trained with a composite loss that also enforces sparsity: $\mathcal{L}=\alpha\mathcal{L}_{BCE}+\beta\mathcal{L}_{NMF}+\gamma\lVert\mathbf{H}\rVert_1$. The model achieves competitive segmentation performance on SAD, OSD, MD, and ND tasks while enabling probing of $\mathbf{H}$ to reveal informative, fine-grained latent factors beyond segmentation, and analyses demonstrate modularity and compactness in the latent space. Probing shows that $\mathbf{H}$ encodes phoneme-, genre-, gender-, and sound-event-related information, supporting interpretability. The work also provides methods to extract and quantify the relevance, modularity, and compactness of components, offering a framework to evaluate interpretability in latent representations, with code available at https://github.com/Lebourdais/3MAS.

Abstract

Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

TL;DR

This work tackles the need for explainable audio segmentation by introducing an explainable-by-design multilabel segmentation model based on non-negative matrix factorization (NMF). A fixed dictionary and a learnable non-negative embedding are used to reconstruct the spectrogram via and to predict frame-level labels through , trained with a composite loss that also enforces sparsity: . The model achieves competitive segmentation performance on SAD, OSD, MD, and ND tasks while enabling probing of to reveal informative, fine-grained latent factors beyond segmentation, and analyses demonstrate modularity and compactness in the latent space. Probing shows that encodes phoneme-, genre-, gender-, and sound-event-related information, supporting interpretability. The work also provides methods to extract and quantify the relevance, modularity, and compactness of components, offering a framework to evaluate interpretability in latent representations, with code available at https://github.com/Lebourdais/3MAS.

Abstract

Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.
Paper Structure (17 sections, 3 equations, 2 figures, 3 tables)

This paper contains 17 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The 3MAS-NMF explainable by-design segmentation model (top) with the different probes (bottom) used to explore the informativeness of the $\mathbf{H}$ embedding. Log spec. means log-spectrogram.
  • Figure 2: Visualization of some components with respect to audio classes: disentangled (left) or complementary (middle, right)