Table of Contents
Fetching ...

Focal Modulation Networks for Interpretable Sound Classification

Luca Della Libera, Cem Subakan, Mirco Ravanelli

TL;DR

This work tackles interpretability in audio by-design using FocalNets, which replace self-attention with focal modulation to produce context-aware representations. The authors apply FocalNets to environmental sound classification on ESC-50, demonstrating competitive accuracy and interpretable modulation maps that pinpoint informative spectro-temporal regions. Compared against a ViT baseline and PIQ post-hoc interpretations, FocalNets achieve strong performance without additional training or processing, indicating that interpretable-by-design architectures can rival or surpass post-hoc methods in audio. These findings suggest practical benefits for trust and transparency in audio systems, with potential extensions to other datasets and user-centered assessments of the listenable interpretations.

Abstract

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.

Focal Modulation Networks for Interpretable Sound Classification

TL;DR

This work tackles interpretability in audio by-design using FocalNets, which replace self-attention with focal modulation to produce context-aware representations. The authors apply FocalNets to environmental sound classification on ESC-50, demonstrating competitive accuracy and interpretable modulation maps that pinpoint informative spectro-temporal regions. Compared against a ViT baseline and PIQ post-hoc interpretations, FocalNets achieve strong performance without additional training or processing, indicating that interpretable-by-design architectures can rival or surpass post-hoc methods in audio. These findings suggest practical benefits for trust and transparency in audio systems, with potential extensions to other datasets and user-centered assessments of the listenable interpretations.

Abstract

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
Paper Structure (15 sections, 7 equations, 3 figures, 1 table)

This paper contains 15 sections, 7 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The focal modulation layer yang2022focalnets.
  • Figure 2: From left to right, the first source (church_bells), the second source (water_drops), the mixture (church_bells + water_drops), and the interpretation corresponding to the FocalNet's prediction (church_bells). The x-axis represents time, the y-axis represents frequency.
  • Figure 3: The effect of quantile order $q$ on interpretability.