An Explainable Proxy Model for Multiabel Audio Segmentation
Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega
TL;DR
This work tackles the need for explainable multilabel audio segmentation by introducing an explainable proxy model trained from a pretrained black-box teacher. The proxy uses non-negative matrix factorization (NMF) to map a learned embedding $\mathbf{H}$ to the frequency domain via a fixed or trainable dictionary $\mathbf{W}$, enabling segment- and class-level explanations through identified frequency bins. The training objective combines knowledge distillation, spectral reconstruction, and sparsity: $\mathcal{L}=\alpha\mathcal{L}_{KD}+\beta\mathcal{L}_{NMF}+\gamma\|\mathbf{H}\|_1$, with $\mathbf{X}\approx\mathbf{W}\mathbf{H}$. Experiments on Aragon Radio and DiHard III show that the WavLM-based proxy models achieve performance close to the teacher while offering strong explainability features, including local (segment-level) and global (prototype) explanations that map decisions to spectral components. This approach provides transparent, frequency-domain insights into multilabel segmentation with practical implications for auditing and trust in audio analysis systems.
Abstract
Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).
