Table of Contents
Fetching ...

An Explainable Proxy Model for Multiabel Audio Segmentation

Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega

TL;DR

This work tackles the need for explainable multilabel audio segmentation by introducing an explainable proxy model trained from a pretrained black-box teacher. The proxy uses non-negative matrix factorization (NMF) to map a learned embedding $\mathbf{H}$ to the frequency domain via a fixed or trainable dictionary $\mathbf{W}$, enabling segment- and class-level explanations through identified frequency bins. The training objective combines knowledge distillation, spectral reconstruction, and sparsity: $\mathcal{L}=\alpha\mathcal{L}_{KD}+\beta\mathcal{L}_{NMF}+\gamma\|\mathbf{H}\|_1$, with $\mathbf{X}\approx\mathbf{W}\mathbf{H}$. Experiments on Aragon Radio and DiHard III show that the WavLM-based proxy models achieve performance close to the teacher while offering strong explainability features, including local (segment-level) and global (prototype) explanations that map decisions to spectral components. This approach provides transparent, frequency-domain insights into multilabel segmentation with practical implications for auditing and trust in audio analysis systems.

Abstract

Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).

An Explainable Proxy Model for Multiabel Audio Segmentation

TL;DR

This work tackles the need for explainable multilabel audio segmentation by introducing an explainable proxy model trained from a pretrained black-box teacher. The proxy uses non-negative matrix factorization (NMF) to map a learned embedding to the frequency domain via a fixed or trainable dictionary , enabling segment- and class-level explanations through identified frequency bins. The training objective combines knowledge distillation, spectral reconstruction, and sparsity: , with . Experiments on Aragon Radio and DiHard III show that the WavLM-based proxy models achieve performance close to the teacher while offering strong explainability features, including local (segment-level) and global (prototype) explanations that map decisions to spectral components. This approach provides transparent, frequency-domain insights into multilabel segmentation with practical implications for auditing and trust in audio analysis systems.

Abstract

Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).
Paper Structure (15 sections, 5 equations, 3 figures, 2 tables)

This paper contains 15 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Diagram of the proposed architecture for NMF-based multi-label segmentation explanation. The A and B branches represent the spectrogram and Wavlm-based proxy models respectively.
  • Figure 2: (Top) segment-level (local) relevant frequency bins for speech and music segmentation according to the relevance threshold $\tau$. (Bottom) Classification scores for (--) speech, (- -) music, and ($\boldsymbol{\cdots}$) overlapped speech with each selected components. Two audio samples from AragonRadio eval set with speech only (left) and music only (right)
  • Figure 3: Global relevant components for speech (sp), music (mu), and overlapped speech (ov).