Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

Chien-Chun Wang; Hung-Shin Lee; Hsin-Min Wang; Berlin Chen

Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

TL;DR

URSA-GAN presents a unified approach to robust speech adaptation by jointly modeling noise and channel mismatches through dual domain encoders and a GAN-based generator. The framework leverages FiLM conditioning, patch-wise contrastive learning, and dynamic stochastic perturbation to simulate target-domain speech from limited unlabeled data, improving downstream ASR and SE. Extensive experiments across multi-domain datasets show consistent improvements over baselines, with strong cross-domain generalization and informative ablation insights. This work offers a scalable data-simulation paradigm for cross-domain speech processing that can enhance robustness in realistic, variable acoustic environments.

Abstract

Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.

Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

TL;DR

Abstract

Paper Structure (35 sections, 9 equations, 6 figures, 11 tables)

This paper contains 35 sections, 9 equations, 6 figures, 11 tables.

Introduction
Preliminaries
Automatic Speech Recognition
Speech Enhancement
Domain Adaptation: UNA-GAN
Proposed Methodology
Framework Overview
Generator and Discriminator
Domain Encoders
Feature Fusion
Patch-wise Contrastive Learning
Training Objective
Adaptation Process
Experimental Setups
Datasets
...and 20 more sections

Figures (6)

Figure 1: Character error rates (CERs) of ASR models on the HAT corpus. Each group of bars represents evaluation on audio from a specific recording device, with individual bars showing CERs for models trained on different devices. Performance generally degrades under device mismatch, underscoring the impact of channel variation and the challenge of cross-device generalization.
Figure 2: The architecture of our proposed framework, URSA-GAN. Solid lines represent the forward data flow during both training and inference phases. The dashed arrows indicate that during the training phase, simulated speech $\mathbf{X}^G$ is used together with target speech $\mathbf{X}^T$ to 1) train the discriminator $D$, and 2) contribute to noise reconstruction and channel consistency. The $\bigoplus$ operator denotes element-wise tensor addition.
Figure 3: Illustration of the patch-wise contrastive learning process. Solid arrows denote the flow of feature extraction and projection, while dashed arrows indicate shared weights between the projection heads. The process involves feature extraction using the generator $G$, patch sampling, projection into the embedding space through the projection head $F$, and cross-entropy loss calculation using the softmax activation function.
Figure 4: PESQ and STOI results of dynamic stochastic perturbation under various standard deviations on the VBD dataset.
Figure 5: The UMAP visualization of embeddings extracted from different noise and channel types in the HAT-ESC dataset.
...and 1 more figures

Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

TL;DR

Abstract

Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (6)