Table of Contents
Fetching ...

Advancing Test-Time Adaptation in Wild Acoustic Test Settings

Hongfu Liu, Hengguan Huang, Ye Wang

TL;DR

This work proposes a novel wild acoustic TTA method, Confidence-Enhanced Adaptation, which performs frame-level adaptation using a confidence-aware weight scheme to avoid filtering out essential information in high-entropy frames and applies consistency regularization during test-time optimization to leverage the inherent short-term consistency of speech signals.

Abstract

Acoustic foundation models, fine-tuned for Automatic Speech Recognition (ASR), suffer from performance degradation in wild acoustic test settings when deployed in real-world scenarios. Stabilizing online Test-Time Adaptation (TTA) under these conditions remains an open and unexplored question. Existing wild vision TTA methods often fail to handle speech data effectively due to the unique characteristics of high-entropy speech frames, which are unreliably filtered out even when containing crucial semantic content. Furthermore, unlike static vision data, speech signals follow short-term consistency, requiring specialized adaptation strategies. In this work, we propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. Our method, Confidence-Enhanced Adaptation, performs frame-level adaptation using a confidence-aware weight scheme to avoid filtering out essential information in high-entropy frames. Additionally, we apply consistency regularization during test-time optimization to leverage the inherent short-term consistency of speech signals. Our experiments on both synthetic and real-world datasets demonstrate that our approach outperforms existing baselines under various wild acoustic test settings, including Gaussian noise, environmental sounds, accent variations, and sung speech.

Advancing Test-Time Adaptation in Wild Acoustic Test Settings

TL;DR

This work proposes a novel wild acoustic TTA method, Confidence-Enhanced Adaptation, which performs frame-level adaptation using a confidence-aware weight scheme to avoid filtering out essential information in high-entropy frames and applies consistency regularization during test-time optimization to leverage the inherent short-term consistency of speech signals.

Abstract

Acoustic foundation models, fine-tuned for Automatic Speech Recognition (ASR), suffer from performance degradation in wild acoustic test settings when deployed in real-world scenarios. Stabilizing online Test-Time Adaptation (TTA) under these conditions remains an open and unexplored question. Existing wild vision TTA methods often fail to handle speech data effectively due to the unique characteristics of high-entropy speech frames, which are unreliably filtered out even when containing crucial semantic content. Furthermore, unlike static vision data, speech signals follow short-term consistency, requiring specialized adaptation strategies. In this work, we propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. Our method, Confidence-Enhanced Adaptation, performs frame-level adaptation using a confidence-aware weight scheme to avoid filtering out essential information in high-entropy frames. Additionally, we apply consistency regularization during test-time optimization to leverage the inherent short-term consistency of speech signals. Our experiments on both synthetic and real-world datasets demonstrate that our approach outperforms existing baselines under various wild acoustic test settings, including Gaussian noise, environmental sounds, accent variations, and sung speech.
Paper Structure (39 sections, 5 equations, 3 figures, 29 tables)

This paper contains 39 sections, 5 equations, 3 figures, 29 tables.

Figures (3)

  • Figure 1: Robustness analysis of Wav2vec2 Base and Large under wild acoustic test settings including 1) Noise (N): additive noises on LibriSpeech test-other set, 2) Accent (A): accents of L2 learners on L2-Arctic subset 3) Singing (S): sung speech on DSing test set. In-Domain (ID) indicates the performance on LibriSpeech test-other set without additive noises. WER is short for Word Error Rate.
  • Figure 2: The overall framework of the proposed method. The figure takes a Connectionist Temporal Classification (CTC) based acoustic foundation model as an example. This framework involves two steps. The confidence enhanced adaptation is first performed to boost the reliability of noisy frames. The temporal consistency regularization is employed across the entire input sequence and jointly optimized with entropy minimization.
  • Figure 3: Frame-Level Entropy Distribution in ASR fine-tuned Acoustic Foundation Models: the entropy distributions are computed for Wav2vec2 Base models on the LibriSpeech noise-corrupted test-other and DSing test datasets across adaptation steps. We employ a threshold of $0.4*\ln{C}$, as recommended in eata, where $C$ represents the number of task classes. Frames with entropy values exceeding this threshold are highlighted in red, indicating high-entropy (h) frames, while low-entropy (l) frames are marked in blue. We use $\bullet$ to denote non-silent (non-sil) frames and $\triangle$ for silent (sil) frames and take the blank symbol as an approximate indicator. The training steps range from 0 to 9, and the results presented in each subfigure are based on the average of 100 random samples.