Table of Contents
Fetching ...

Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech

Guan-Ting Lin, Wei-Ping Huang, Hung-yi Lee

TL;DR

Dynamic SUTA (DSUTA), an entropy-minimization-based continual TTA method for ASR, is introduced to enhance DSUTA’s robustness for time-varying data, and a dynamic reset strategy to automatically detect domain shifts and reset the model, making it more effective at handling multi-domain data.

Abstract

Deep Learning-based end-to-end Automatic Speech Recognition (ASR) has made significant strides but still struggles with performance on out-of-domain samples due to domain shifts in real-world scenarios. Test-Time Adaptation (TTA) methods address this issue by adapting models using test samples at inference time. However, current ASR TTA methods have largely focused on non-continual TTA, which limits cross-sample knowledge learning compared to continual TTA. In this work, we first propose a Fast-slow TTA framework for ASR that leverages the advantage of continual and non-continual TTA. Following this framework, we introduce Dynamic SUTA (DSUTA), an entropy-minimization-based continual TTA method for ASR. To enhance DSUTA robustness for time-varying data, we design a dynamic reset strategy to automatically detect domain shifts and reset the model, making it more effective at handling multi-domain data. Our method demonstrates superior performance on various noisy ASR datasets, outperforming both non-continual and continual TTA baselines while maintaining robustness to domain changes without requiring domain boundary information.

Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech

TL;DR

Dynamic SUTA (DSUTA), an entropy-minimization-based continual TTA method for ASR, is introduced to enhance DSUTA’s robustness for time-varying data, and a dynamic reset strategy to automatically detect domain shifts and reset the model, making it more effective at handling multi-domain data.

Abstract

Deep Learning-based end-to-end Automatic Speech Recognition (ASR) has made significant strides but still struggles with performance on out-of-domain samples due to domain shifts in real-world scenarios. Test-Time Adaptation (TTA) methods address this issue by adapting models using test samples at inference time. However, current ASR TTA methods have largely focused on non-continual TTA, which limits cross-sample knowledge learning compared to continual TTA. In this work, we first propose a Fast-slow TTA framework for ASR that leverages the advantage of continual and non-continual TTA. Following this framework, we introduce Dynamic SUTA (DSUTA), an entropy-minimization-based continual TTA method for ASR. To enhance DSUTA robustness for time-varying data, we design a dynamic reset strategy to automatically detect domain shifts and reset the model, making it more effective at handling multi-domain data. Our method demonstrates superior performance on various noisy ASR datasets, outperforming both non-continual and continual TTA baselines while maintaining robustness to domain changes without requiring domain boundary information.
Paper Structure (33 sections, 8 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 8 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the proposed Fast-slow TTA framework and dynamic reset strategy with time-varying speech domains. The Fast-Slow TTA framework includes meta-parameters that update slowly to capture cross-domain knowledge, while other parameters update fast for the incoming test samples. The Dynamic reset strategy automatically detects domain shifts and resets the model to the source model.
  • Figure 2: Illustration of the 3 different TTA approaches.
  • Figure 3: Sketch of DSUTA with the dynamic reset strategy. The domain construction stage and the shift detection stage alternate over time. When a large shift is detected, apply model reset to DSUTA, i.e., update $\phi_{t+1}=\phi_{pre}$.
  • Figure 4: WER difference compared to the pre-trained model on CM domain over time. Data is smoothed by a window with a size of 100.
  • Figure 5: Distributions of averaged LII (over 5 samples) from the GS domain (in) and non-GS domains (out).
  • ...and 2 more figures