Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift

Jian Guan; Jiantong Tian; Qiaoxi Zhu; Feiyang Xiao; Hejing Zhang; Xubo Liu

Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift

Jian Guan, Jiantong Tian, Qiaoxi Zhu, Feiyang Xiao, Hejing Zhang, Xubo Liu

TL;DR

The paper tackles anomalous sound detection (ASD) under multi-domain shift and introduces Gradient Reversal-based Hierarchical feature Disentanglement (GRHD). GRHD combines a gradient reversal classifier to extract coarse domain-unrelated features $z_{rev}$ with a hierarchical metadata constraint that learns fine-grained domain-related features $z_{sec}$ and $z_{att}$, optimized by $L_{total} = \alpha L_{rev} + \beta L_{sec} + \gamma L_{att}$. Through adversarial learning and hierarchical constraints, GRHD achieves clearer separation of domain-related versus domain-unrelated features, improving ASD performance under domain shift. Evaluated on the DCASE 2022 Task 2 dataset, GRHD attains state-of-the-art HAUC, validating the effectiveness of both the gradient reversal mechanism and hierarchical metadata guidance for robust ASD in real-world, shifting environments.

Abstract

Anomalous sound detection (ASD) encounters difficulties with domain shift, where the sounds of machines in target domains differ significantly from those in source domains due to varying operating conditions. Existing methods typically employ domain classifiers to enhance detection performance, but they often overlook the influence of domain-unrelated information. This oversight can hinder the model's ability to clearly distinguish between domains, thereby weakening its capacity to differentiate normal from abnormal sounds. In this paper, we propose a Gradient Reversal-based Hierarchical feature Disentanglement (GRHD) method to address the above challenge. GRHD uses gradient reversal to separate domain-related features from domain-unrelated ones, resulting in more robust feature representations. Additionally, the method employs a hierarchical structure to guide the learning of fine-grained, domain-specific features by leveraging available metadata, such as section IDs and machine sound attributes. Experimental results on the DCASE 2022 Challenge Task 2 dataset demonstrate that the proposed method significantly improves ASD performance under domain shift.

Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift

TL;DR

with a hierarchical metadata constraint that learns fine-grained domain-related features

and

, optimized by

. Through adversarial learning and hierarchical constraints, GRHD achieves clearer separation of domain-related versus domain-unrelated features, improving ASD performance under domain shift. Evaluated on the DCASE 2022 Task 2 dataset, GRHD attains state-of-the-art HAUC, validating the effectiveness of both the gradient reversal mechanism and hierarchical metadata guidance for robust ASD in real-world, shifting environments.

Abstract

Paper Structure (14 sections, 6 equations, 3 figures, 2 tables)

This paper contains 14 sections, 6 equations, 3 figures, 2 tables.

Introduction
Proposed Method
Gradient Reversal Based Feature Disentanglement
Hierarchical Metadata Constrained Domain-Related Feature Learning
Experiment and Result
Experimental Setup
Dataset
Implementation
Performance Metrics
Experimental Results
Performance Comparison
Ablation Study
Visualization Analysis
Conclusion

Figures (3)

Figure 1: Illustrations of (a) the hierarchical metadata structure accompanying the audio data for anomalous sound detection, and (b) the latent space of the audio data, containing both domain-related and domain-unrelated features. For example, S0_AG1 corresponds to the latent space of audio labeled by section 00 and attribute group 1.
Figure 2: The proposed GRHD method's training process. The gradient reversal classifier $GRC(\cdot)$ is used for feature disentanglement, and hierarchical metadata guides fine-grained, domain-specific audio feature learning. $\textbf{z}_{rev}$ represents the disentangled domain-related feature, while $\textbf{z}_{sec}$ and $\textbf{z}_{att}$ correspond to section ID and attribute group features. $\mathcal{C}_{sec}$ and $\mathcal{C}_{att}$ are the classifiers for section ID and attribute group, producing labels $\hat{l}_{sec}$ and $\hat{l}_{att}$. The actual labels accompanying the audio data are $l_{sec}$ and $l_{att}$. The predicted label $\hat{l}_{rev}$ is from $GRC(\cdot)$.
Figure 3: The t-SNE visualization of the latent features with or without gradient reversal strategy for machine type ToyCar. Different colours represent different section IDs. “$\bullet$” and “$\times$” represent normal and anomalous sounds, respectively.

Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift

TL;DR

Abstract

Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift

Authors

TL;DR

Abstract

Table of Contents

Figures (3)