From Vision to Sound: Advancing Audio Anomaly Detection with Vision-Based Algorithms
Manuel Barusco, Francesco Borsatti, Davide Dalle Pezze, Francesco Paissan, Elisabetta Farella, Gian Antonio Susto
TL;DR
This paper tackles audio anomaly detection (AAD) by transferring Visual Anomaly Detection (VAD) paradigms to the audio domain, with a focus on explainable, fine-grained localization in spectrograms. It uses a framework where a spectrogram is embedded via a pre-trained audio encoder (e.g., CNN14 from CLAP) and several VAD methods generate an anomaly heatmap over the time-frequency plane, all in an unsupervised setting. The authors introduce new evaluation metrics for spectrogram localization and assess four VAD approaches (PatchCore, Padim, CFA, STFPM) on industrial (MIMII) and environmental (EnvMix) benchmarks across varying SNRs. The results show improved explainability through anomaly maps and demonstrate cross-domain potential, while highlighting the need for better faithfulness metrics and the impact of feature extractors on performance.
Abstract
Recent advances in Visual Anomaly Detection (VAD) have introduced sophisticated algorithms leveraging embeddings generated by pre-trained feature extractors. Inspired by these developments, we investigate the adaptation of such algorithms to the audio domain to address the problem of Audio Anomaly Detection (AAD). Unlike most existing AAD methods, which primarily classify anomalous samples, our approach introduces fine-grained temporal-frequency localization of anomalies within the spectrogram, significantly improving explainability. This capability enables a more precise understanding of where and when anomalies occur, making the results more actionable for end users. We evaluate our approach on industrial and environmental benchmarks, demonstrating the effectiveness of VAD techniques in detecting anomalies in audio signals. Moreover, they improve explainability by enabling localized anomaly identification, making audio anomaly detection systems more interpretable and practical.
