Sub-Band Spectral Matching with Localized Score Aggregation for Robust Anomalous Sound Detection

Phurich Saengthong; Takahiro Shinozaki

Sub-Band Spectral Matching with Localized Score Aggregation for Robust Anomalous Sound Detection

Phurich Saengthong, Takahiro Shinozaki

Abstract

Detecting subtle deviations in noisy acoustic environments is central to anomalous sound detection (ASD). A common training-free ASD pipeline temporally pools frame-level representations into a band-preserving feature vector and scores anomalies using a single nearest-neighbor match. However, this global matching can inflate normal-score variance through two effects. First, when normal sounds exhibit band-wise variability, a single global neighbor forces all bands to share the same reference, increasing band-level mismatch. Second, cosine-based matching is energy-coupled, allowing a few high-energy bands to dominate score computation under normal energy fluctuations and further increase variance. We propose BEAM, which stores temporally pooled sub-band vectors in a memory bank, retrieves neighbors per sub-band, and uniformly aggregates scores to reduce normal-score variability and improve discriminability. We further introduce a parameter-free adaptive fusion to better handle diverse temporal dynamics in sub-band responses. Experiments on multiple DCASE Task 2 benchmarks show strong performance without task-specific training, robustness to noise and domain shifts, and complementary gains when combined with encoder fine-tuning.

Sub-Band Spectral Matching with Localized Score Aggregation for Robust Anomalous Sound Detection

Abstract

Paper Structure (40 sections, 2 theorems, 57 equations, 5 figures, 7 tables)

This paper contains 40 sections, 2 theorems, 57 equations, 5 figures, 7 tables.

Introduction
Preliminary: Global Band Matching
Clip-Level Representations
Temporally pooled spectral and cepstral features
Spectrogram embeddings from pretrained encoders
LPC spectrum
Tied-Reference Global Matching
Global Band Cosine Decomposition
Method
Memory Bank of Sub-Band Embeddings
Local Matching and Clip-Level Aggregation
Dynamic Mean--Max Integration (DMM)
Theoretical Analysis
Related Work
Experimental Details
...and 25 more sections

Key Result

Proposition 1

Under the normal class $\mathcal{N}$, we upper-bound $\mathrm{Var}(S_{\mathrm{sub}}\mid\mathcal{N})$ by a constant multiple of $\mathrm{Var}(S_{\mathrm{glob}}\mid\mathcal{N})$. The constant separates the effects of (i) per-band template selection versus a tied global reference and (ii) uniform aggre Define Define By construction, $\lambda\ge 0$ and $-2\,\mathrm{Cov}(S_{\mathrm{sub}},P\mid\mathca

Figures (5)

Figure 1: Overview of BEAM and AdaBEAM. Normal audio is mapped to handcrafted or deep features, sliced into sub-bands, and stored in a shared memory bank. At test time, each query sub-band is matched within its band-aligned memory, and the resulting band scores are uniformly aggregated to produce the anomaly score. AdaBEAM adds Dynamic Mean--Max (DMM) fusion by building mean- and max-pooled sub-band memories, scoring both views, and combining the resulting clip scores with a simple parameter-free rule.
Figure 2: Quantitative results for the theoretical analysis using BEATs iter3 features, averaged across machine types for each benchmark (DCASE2020T2, DCASE2023T2, and DCASE2024T2). (Left) Variance ratio (no LDN). (Center-left) Mean gap ratio (no LDN). (Center-right) $d'$ ratio (no LDN). (Right) $d'$ ratio with Local Density Normalization (LDN), including additional results with DMM.
Figure 3: Comparison of sub-band window sizes of BEAM on handcrafted and deep features (BEATs) in terms of AUC across DCASE2020T2, DCASE2023T2, and DCASE2024T2. Window size is reported as a fraction of the feature length along the windowed axis. For handcrafted features, we use non-overlapping windows for 10%--40% (stride equals window length); for 50%--90%, we use a two-window full-coverage rule with an end-aligned final window (equivalently, stride $=F-C$), which may overlap with the first window. For BEATs, windows are formed by grouping a fixed number of patch embeddings (8 patches total), thus only window sizes aligned to the patch grid are meaningful. Dotted horizontal lines show the baseline Tmean result.
Figure 4: Effect of varying $K$ in sub-band density normalization on Official Scores across DCASE2020T2, DCASE2023T2, and DCASE2024T2. Results are shown for Tmean, Tmax, and DMM variants using BEATs features.
Figure 5: Comparison of features on the MIMII dataset.

Theorems & Definitions (4)

Proposition 1: Normal-variance ratio from per-sub-band selection and global aggregation
proof
Theorem 1: Sufficient condition for $d'$ improvement
proof

Sub-Band Spectral Matching with Localized Score Aggregation for Robust Anomalous Sound Detection

Abstract

Sub-Band Spectral Matching with Localized Score Aggregation for Robust Anomalous Sound Detection

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)