Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models
Xinhu Zheng, Anbai Jiang, Bing Han, Yanmin Qian, Pingyi Fan, Jia Liu, Wei-Qiang Zhang
TL;DR
This work tackles anomalous sound detection (ASD) under domain shift and limited labeled data in industrial settings by leveraging audio-focused pre-trained models and efficient fine-tuning with Low-Rank Adaptation (LoRA). A two-stage pipeline combines a front-end that extracts semantic embeddings from pre-trained models with a back-end KNN-based anomaly detector, guided by ArcFace loss to produce discriminative embeddings, and augmented by SpecAugment where applicable. Through extensive ablations, the authors show that audio-pre-trained models outperform speech-pre-trained ones, with LoRA tuning (best at rank $r=64$ and applied across all Transformer layers) delivering the strongest generalization and yielding a final harmonic score of 77.75% on DCASE 2023 Task 2—6.48% higher than prior SOTA. The results demonstrate the practical viability of using audio pre-trained models and LoRA adaptation to enhance ASD performance in real-world, data-scarce scenarios.
Abstract
Anomalous Sound Detection (ASD) has gained significant interest through the application of various Artificial Intelligence (AI) technologies in industrial settings. Though possessing great potential, ASD systems can hardly be readily deployed in real production sites due to the generalization problem, which is primarily caused by the difficulty of data collection and the complexity of environmental factors. This paper introduces a robust ASD model that leverages audio pre-trained models. Specifically, we fine-tune these models using machine operation data, employing SpecAug as a data augmentation strategy. Additionally, we investigate the impact of utilizing Low-Rank Adaptation (LoRA) tuning instead of full fine-tuning to address the problem of limited data for fine-tuning. Our experiments on the DCASE2023 Task 2 dataset establish a new benchmark of 77.75% on the evaluation set, with a significant improvement of 6.48% compared with previous state-of-the-art (SOTA) models, including top-tier traditional convolutional networks and speech pre-trained models, which demonstrates the effectiveness of audio pre-trained models with LoRA tuning. Ablation studies are also conducted to showcase the efficacy of the proposed scheme.
