Table of Contents
Fetching ...

Interpreting End-to-End Deep Learning Models for Speech Source Localization Using Layer-wise Relevance Propagation

Luca Comanducci, Fabio Antonacci, Augusto Sarti

TL;DR

This work addresses the interpretability of end-to-end deep learning models for speech source localization by applying Layer-wise Relevance Propagation to two architectures, LocCNN and SampleCNN. The analysis reveals that both networks rely on temporal onset information and effectively denoise/de-reverberate microphone signals to strengthen inter-microphone correlations, improving Time-Difference of Arrival estimation when GCC-PHAT is computed on relevance signals. The study shows that LRP can uncover meaningful processing steps in acoustic DL models, guiding more transparent design and potential improvements in multichannel localization systems. Overall, the paper demonstrates the value of XAI techniques for diagnosing and interpreting end-to-end speech processing networks in complex acoustic environments.

Abstract

Deep learning models are widely applied in the signal processing community, yet their inner working procedure is often treated as a black box. In this paper, we investigate the use of eXplainable Artificial Intelligence (XAI) techniques to learning-based end-to-end speech source localization models. We consider the Layer-wise Relevance Propagation (LRP) technique, which aims to determine which parts of the input are more important for the output prediction. Using LRP we analyze two state-of-the-art models, of differing architectural complexity that map audio signals acquired by the microphones to the cartesian coordinates of the source. Specifically, we inspect the relevance associated with the input features of the two models and discover that both networks denoise and de-reverberate the microphone signals to compute more accurate statistical correlations between them and consequently localize the sources. To further demonstrate this fact, we estimate the Time-Difference of Arrivals (TDoAs) via the Generalized Cross Correlation with Phase Transform (GCC-PHAT) using both microphone signals and relevance signals extracted from the two networks and show that through the latter we obtain more accurate time-delay estimation results.

Interpreting End-to-End Deep Learning Models for Speech Source Localization Using Layer-wise Relevance Propagation

TL;DR

This work addresses the interpretability of end-to-end deep learning models for speech source localization by applying Layer-wise Relevance Propagation to two architectures, LocCNN and SampleCNN. The analysis reveals that both networks rely on temporal onset information and effectively denoise/de-reverberate microphone signals to strengthen inter-microphone correlations, improving Time-Difference of Arrival estimation when GCC-PHAT is computed on relevance signals. The study shows that LRP can uncover meaningful processing steps in acoustic DL models, guiding more transparent design and potential improvements in multichannel localization systems. Overall, the paper demonstrates the value of XAI techniques for diagnosing and interpreting end-to-end speech processing networks in complex acoustic environments.

Abstract

Deep learning models are widely applied in the signal processing community, yet their inner working procedure is often treated as a black box. In this paper, we investigate the use of eXplainable Artificial Intelligence (XAI) techniques to learning-based end-to-end speech source localization models. We consider the Layer-wise Relevance Propagation (LRP) technique, which aims to determine which parts of the input are more important for the output prediction. Using LRP we analyze two state-of-the-art models, of differing architectural complexity that map audio signals acquired by the microphones to the cartesian coordinates of the source. Specifically, we inspect the relevance associated with the input features of the two models and discover that both networks denoise and de-reverberate the microphone signals to compute more accurate statistical correlations between them and consequently localize the sources. To further demonstrate this fact, we estimate the Time-Difference of Arrivals (TDoAs) via the Generalized Cross Correlation with Phase Transform (GCC-PHAT) using both microphone signals and relevance signals extracted from the two networks and show that through the latter we obtain more accurate time-delay estimation results.
Paper Structure (12 sections, 2 equations, 4 figures, 1 table)

This paper contains 12 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Manipulation of relevant input features results using the random , amplitude and LRP strategies.
  • Figure 2: Signal at a microphone placed in $[0.575m,7m,1.2m]^T$ (a), and corresponding relevance using LocCNN (b) and SampleCNN (c) for a source placed in $[1.53m,5.71m,1.15m]^T$.
  • Figure 3: STFT of the signal measured at microphone placed in $[0.57m, 7m, 1.2m ]^T$ from a source placed in $[1.48m, 5.37m,1.33m]^T$. Top row: microphone signals. Middle row: LocCNN relevance signals. Bottom row: SampleCNN relevance signals. Sub-captions denote corresponding environmental conditions.
  • Figure 4: GCC-PHATs between two microphones placed in $[1.47m, 7m,1.2m]^T$ and $[1.92m, 7m,1.2m]^T$. Source in $[1.48m, 5.43m,1.23m]^T$. Top row: microphone signals. Middle row: LocCNN relevance signals. Bottom row: SampleCNN relevance signals. $\ell$ denotes the time lag in samples. Sub-captions indicate the environmental conditions.