Table of Contents
Fetching ...

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa

TL;DR

DHAuDS addresses the challenge of domain shift in audio classification by introducing a unified benchmark that simulates realistic, dynamic, and heterogeneous acoustic degradations. It defines four datasets—UrbanSound8K-C, SpeechCommandsV2-C, VocalSound-C, and ReefSet-C—each with dynamic severity and mixed-noise conditions, accompanied by 14 evaluation criteria and 50 unrepeated metrics across 124 experiments. The methodology combines entropy-based TTA losses with a consistency loss over two temporally shifted views, and employs a binary learning-rate strategy to improve stability and performance during adaptation. Findings show consistent post-adaptation gains across datasets and models (HuBERT, AMAuT, CoNMix++), with insights into hyperparameter stability and trade-offs, while emphasizing DHAuDS’s reproducibility and real-world relevance for advancing robust, adaptive audio modeling.

Abstract

Audio classifiers frequently face domain shift, when models trained on one dataset lose accuracy on data recorded in acoustically different conditions. Previous Test-Time Adaptation (TTA) research in speech and sound analysis often evaluates models under fixed or mismatched noise settings, that fail to mimic real-world variability. To overcome these limitations, this paper presents DHAuDS (Dynamic and Heterogeneous Audio Domain Shift), a benchmark designed to assess TTA approaches under more realistic and diverse acoustic shifts. DHAuDS comprises four standardized benchmarks: UrbanSound8K-C, SpeechCommandsV2-C, VocalSound-C, and ReefSet-C, each constructed with dynamic corruption severity levels and heterogeneous noise types to simulate authentic audio degradation scenarios. The framework defines 14 evaluation criteria for each benchmark (8 for UrbanSound8K-C), resulting in 50 unrepeated criteria (124 experiments) that collectively enable fair, reproducible, and cross-domain comparison of TTA algorithms. Through the inclusion of dynamic and mixed-domain noise settings, DHAuDS offers a consistent and publicly reproducible testbed to support ongoing studies in robust and adaptive audio modeling.

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

TL;DR

DHAuDS addresses the challenge of domain shift in audio classification by introducing a unified benchmark that simulates realistic, dynamic, and heterogeneous acoustic degradations. It defines four datasets—UrbanSound8K-C, SpeechCommandsV2-C, VocalSound-C, and ReefSet-C—each with dynamic severity and mixed-noise conditions, accompanied by 14 evaluation criteria and 50 unrepeated metrics across 124 experiments. The methodology combines entropy-based TTA losses with a consistency loss over two temporally shifted views, and employs a binary learning-rate strategy to improve stability and performance during adaptation. Findings show consistent post-adaptation gains across datasets and models (HuBERT, AMAuT, CoNMix++), with insights into hyperparameter stability and trade-offs, while emphasizing DHAuDS’s reproducibility and real-world relevance for advancing robust, adaptive audio modeling.

Abstract

Audio classifiers frequently face domain shift, when models trained on one dataset lose accuracy on data recorded in acoustically different conditions. Previous Test-Time Adaptation (TTA) research in speech and sound analysis often evaluates models under fixed or mismatched noise settings, that fail to mimic real-world variability. To overcome these limitations, this paper presents DHAuDS (Dynamic and Heterogeneous Audio Domain Shift), a benchmark designed to assess TTA approaches under more realistic and diverse acoustic shifts. DHAuDS comprises four standardized benchmarks: UrbanSound8K-C, SpeechCommandsV2-C, VocalSound-C, and ReefSet-C, each constructed with dynamic corruption severity levels and heterogeneous noise types to simulate authentic audio degradation scenarios. The framework defines 14 evaluation criteria for each benchmark (8 for UrbanSound8K-C), resulting in 50 unrepeated criteria (124 experiments) that collectively enable fair, reproducible, and cross-domain comparison of TTA algorithms. Through the inclusion of dynamic and mixed-domain noise settings, DHAuDS offers a consistent and publicly reproducible testbed to support ongoing studies in robust and adaptive audio modeling.

Paper Structure

This paper contains 29 sections, 7 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparison of ROC–AUC performance between high-momentum (HM = 0.90) and low-momentum (LM = 0.70) settings when performing AMAuT ENQ-L1 on RS-C. All other hyperparameters remain identical.
  • Figure 2: Comparison of ROC–AUC performance between a single learning rate (SLR) and a binary learning rate (BLR) strategy when performing AMAuT ENQ-L1 on RS-C. All other hyperparameters remain unchanged.
  • Figure 3: Prediction performance of CoNMix++ TST-L1 on RS-C during TTA. The model exhibits a consistent decline in performance even when using low-momentum and BLR strategies, indicating an abnormal negative adaptation effect.
  • Figure 4: Comparision of ROC-AUC performance between high-momentum (HM=0.9) and low-momentum (LM=0.75) settings when performing HuBERT ENQ-L1 on RS-C. All other hyperparameters remain identical.
  • Figure 5: Comparision of ROC-AUC performance between high-momentum (HM=0.9) and low-momentum (LM=0.75) settings when performing CoNMix++ ENQ-L1 on RS-C. All other hyperparameters remain identical.
  • ...and 2 more figures