Retaining Mixture Representations for Domain Generalized Anomalous Sound Detection
Phurich Saengthong, Tomoya Nishida, Kota Dohi, Natsuo Yamashita, Yohei Kawaguchi
TL;DR
This paper addresses anomalous sound detection under distribution shifts by fixing a key limitation of training-free SSL backbones: their mixture representations can degrade in noisy conditions. It introduces a retain-not-denoise pretraining strategy that combines a multi-label tagging objective with a mixture alignment objective, matching a student encoder's mixture embeddings to convex teacher embeddings derived from clean and noise sources, with a fixed mix ratio $\\lambda = 0.5$. Empirical results show that this approach improves robustness across stationary, non-stationary, and mismatched noise conditions, outperforming denoising baselines and achieving notable gains at low SNRs; domain-matched pre-training data and layer-sensitive representations further bolster performance. The findings highlight the importance of preserving full mixture information and demonstrate the practicality of training-free ASD with improved generalization to real-world, noisy mixtures.
Abstract
Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.
