Focal Modulation Networks for Interpretable Sound Classification
Luca Della Libera, Cem Subakan, Mirco Ravanelli
TL;DR
This work tackles interpretability in audio by-design using FocalNets, which replace self-attention with focal modulation to produce context-aware representations. The authors apply FocalNets to environmental sound classification on ESC-50, demonstrating competitive accuracy and interpretable modulation maps that pinpoint informative spectro-temporal regions. Compared against a ViT baseline and PIQ post-hoc interpretations, FocalNets achieve strong performance without additional training or processing, indicating that interpretable-by-design architectures can rival or surpass post-hoc methods in audio. These findings suggest practical benefits for trust and transparency in audio systems, with potential extensions to other datasets and user-centered assessments of the listenable interpretations.
Abstract
The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
