Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

Martin Lebourdais; Théo Mariotte; Antonio Almudévar; Marie Tahon; Alfonso Ortega

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

Martin Lebourdais, Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega

TL;DR

This work tackles the need for explainable audio segmentation by introducing an explainable-by-design multilabel segmentation model based on non-negative matrix factorization (NMF). A fixed dictionary $\mathbf{W}$ and a learnable non-negative embedding $\mathbf{H}$ are used to reconstruct the spectrogram via $\tilde{\mathbf{X}}=\mathbf{W}\mathbf{H}$ and to predict frame-level labels through $\hat{\mathbf{y}}=\boldsymbol{\theta}\mathbf{H}$, trained with a composite loss that also enforces sparsity: $\mathcal{L}=\alpha\mathcal{L}_{BCE}+\beta\mathcal{L}_{NMF}+\gamma\lVert\mathbf{H}\rVert_1$. The model achieves competitive segmentation performance on SAD, OSD, MD, and ND tasks while enabling probing of $\mathbf{H}$ to reveal informative, fine-grained latent factors beyond segmentation, and analyses demonstrate modularity and compactness in the latent space. Probing shows that $\mathbf{H}$ encodes phoneme-, genre-, gender-, and sound-event-related information, supporting interpretability. The work also provides methods to extract and quantify the relevance, modularity, and compactness of components, offering a framework to evaluate interpretability in latent representations, with code available at https://github.com/Lebourdais/3MAS.

Abstract

Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

TL;DR

and a learnable non-negative embedding

are used to reconstruct the spectrogram via

and to predict frame-level labels through

, trained with a composite loss that also enforces sparsity:

. The model achieves competitive segmentation performance on SAD, OSD, MD, and ND tasks while enabling probing of

to reveal informative, fine-grained latent factors beyond segmentation, and analyses demonstrate modularity and compactness in the latent space. Probing shows that

encodes phoneme-, genre-, gender-, and sound-event-related information, supporting interpretability. The work also provides methods to extract and quantify the relevance, modularity, and compactness of components, offering a framework to evaluate interpretability in latent representations, with code available at https://github.com/Lebourdais/3MAS.

Abstract

Paper Structure (17 sections, 3 equations, 2 figures, 3 tables)

This paper contains 17 sections, 3 equations, 2 figures, 3 tables.

Introduction
Related works
Multilabel NMF segmentation model
Problem formulation
Training procedure
Implementation details
Segmentation evaluation
Experimental protocol
Segmentation results
Probing H activations
Experimental protocol
Classification results
Analysis of relevant components
Relevant component extraction for explanation
Compactness and modularity
...and 2 more sections

Figures (2)

Figure 1: The 3MAS-NMF explainable by-design segmentation model (top) with the different probes (bottom) used to explore the informativeness of the $\mathbf{H}$ embedding. Log spec. means log-spectrogram.
Figure 2: Visualization of some components with respect to audio classes: disentangled (left) or complementary (middle, right)

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

TL;DR

Abstract

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

Authors

TL;DR

Abstract

Table of Contents

Figures (2)