Table of Contents
Fetching ...

From Vision to Sound: Advancing Audio Anomaly Detection with Vision-Based Algorithms

Manuel Barusco, Francesco Borsatti, Davide Dalle Pezze, Francesco Paissan, Elisabetta Farella, Gian Antonio Susto

TL;DR

This paper tackles audio anomaly detection (AAD) by transferring Visual Anomaly Detection (VAD) paradigms to the audio domain, with a focus on explainable, fine-grained localization in spectrograms. It uses a framework where a spectrogram is embedded via a pre-trained audio encoder (e.g., CNN14 from CLAP) and several VAD methods generate an anomaly heatmap over the time-frequency plane, all in an unsupervised setting. The authors introduce new evaluation metrics for spectrogram localization and assess four VAD approaches (PatchCore, Padim, CFA, STFPM) on industrial (MIMII) and environmental (EnvMix) benchmarks across varying SNRs. The results show improved explainability through anomaly maps and demonstrate cross-domain potential, while highlighting the need for better faithfulness metrics and the impact of feature extractors on performance.

Abstract

Recent advances in Visual Anomaly Detection (VAD) have introduced sophisticated algorithms leveraging embeddings generated by pre-trained feature extractors. Inspired by these developments, we investigate the adaptation of such algorithms to the audio domain to address the problem of Audio Anomaly Detection (AAD). Unlike most existing AAD methods, which primarily classify anomalous samples, our approach introduces fine-grained temporal-frequency localization of anomalies within the spectrogram, significantly improving explainability. This capability enables a more precise understanding of where and when anomalies occur, making the results more actionable for end users. We evaluate our approach on industrial and environmental benchmarks, demonstrating the effectiveness of VAD techniques in detecting anomalies in audio signals. Moreover, they improve explainability by enabling localized anomaly identification, making audio anomaly detection systems more interpretable and practical.

From Vision to Sound: Advancing Audio Anomaly Detection with Vision-Based Algorithms

TL;DR

This paper tackles audio anomaly detection (AAD) by transferring Visual Anomaly Detection (VAD) paradigms to the audio domain, with a focus on explainable, fine-grained localization in spectrograms. It uses a framework where a spectrogram is embedded via a pre-trained audio encoder (e.g., CNN14 from CLAP) and several VAD methods generate an anomaly heatmap over the time-frequency plane, all in an unsupervised setting. The authors introduce new evaluation metrics for spectrogram localization and assess four VAD approaches (PatchCore, Padim, CFA, STFPM) on industrial (MIMII) and environmental (EnvMix) benchmarks across varying SNRs. The results show improved explainability through anomaly maps and demonstrate cross-domain potential, while highlighting the need for better faithfulness metrics and the impact of feature extractors on performance.

Abstract

Recent advances in Visual Anomaly Detection (VAD) have introduced sophisticated algorithms leveraging embeddings generated by pre-trained feature extractors. Inspired by these developments, we investigate the adaptation of such algorithms to the audio domain to address the problem of Audio Anomaly Detection (AAD). Unlike most existing AAD methods, which primarily classify anomalous samples, our approach introduces fine-grained temporal-frequency localization of anomalies within the spectrogram, significantly improving explainability. This capability enables a more precise understanding of where and when anomalies occur, making the results more actionable for end users. We evaluate our approach on industrial and environmental benchmarks, demonstrating the effectiveness of VAD techniques in detecting anomalies in audio signals. Moreover, they improve explainability by enabling localized anomaly identification, making audio anomaly detection systems more interpretable and practical.

Paper Structure

This paper contains 16 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example of interpretability results obtained by our Approaches The first image illustrates a mixed (corrupted) audio signal containing both normal and anomalous components. The second image displays the spectrogram of the anomaly alone. The third image shows the anomaly map generated by the model, highlighting its ability to accurately identify and isolate the anomalous segments within the mixed signal.
  • Figure 2: We propose to work on the embeddings produced by the audio feature extractor and test several sota algorithms proposed originally for Visual Anomaly Detection