AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan
TL;DR
This work tackles machine anomalous sound detection (ASD) under the constraint of training only on normal sounds. It introduces AnoPatch, a Vision Transformer backbone pre-trained on AudioSet and fine-tuned on machine audio with metadata via ArcFace loss, using patch-level mel-spectrogram representations and ECAPA-TDNN pooling to produce robust utterance embeddings. Anomaly scores are computed with a KNN-based backend leveraging a memory bank of normal embeddings, enabling strong performance without model ensembling. Across the DCASE 2020 and 2023 machine ASD benchmarks, AnoPatch achieves state-of-the-art results and provides empirical evidence that better consistency between pre-training data, model architecture, and fine-tuning tasks yields substantial gains for machine ASD.
Abstract
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.
