Table of Contents
Fetching ...

Retaining Mixture Representations for Domain Generalized Anomalous Sound Detection

Phurich Saengthong, Tomoya Nishida, Kota Dohi, Natsuo Yamashita, Yohei Kawaguchi

TL;DR

This paper addresses anomalous sound detection under distribution shifts by fixing a key limitation of training-free SSL backbones: their mixture representations can degrade in noisy conditions. It introduces a retain-not-denoise pretraining strategy that combines a multi-label tagging objective with a mixture alignment objective, matching a student encoder's mixture embeddings to convex teacher embeddings derived from clean and noise sources, with a fixed mix ratio $\\lambda = 0.5$. Empirical results show that this approach improves robustness across stationary, non-stationary, and mismatched noise conditions, outperforming denoising baselines and achieving notable gains at low SNRs; domain-matched pre-training data and layer-sensitive representations further bolster performance. The findings highlight the importance of preserving full mixture information and demonstrate the practicality of training-free ASD with improved generalization to real-world, noisy mixtures.

Abstract

Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.

Retaining Mixture Representations for Domain Generalized Anomalous Sound Detection

TL;DR

This paper addresses anomalous sound detection under distribution shifts by fixing a key limitation of training-free SSL backbones: their mixture representations can degrade in noisy conditions. It introduces a retain-not-denoise pretraining strategy that combines a multi-label tagging objective with a mixture alignment objective, matching a student encoder's mixture embeddings to convex teacher embeddings derived from clean and noise sources, with a fixed mix ratio . Empirical results show that this approach improves robustness across stationary, non-stationary, and mismatched noise conditions, outperforming denoising baselines and achieving notable gains at low SNRs; domain-matched pre-training data and layer-sensitive representations further bolster performance. The findings highlight the importance of preserving full mixture information and demonstrate the practicality of training-free ASD with improved generalization to real-world, noisy mixtures.

Abstract

Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.

Paper Structure

This paper contains 10 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the proposed approach.
  • Figure 2: Comparison of GenRepgenrep performance using the BEATs audio encoder pmlr-v202-chen23ag (blue), the denoising baseline (orange), and the proposed retention method (green) on the DCASE2023T2 evaluation set (left) dohi_description_2023 and the DCASE2025T2 evaluation set (right) Nishida_arXiv2025_01.