Table of Contents
Fetching ...

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

Chien-Chun Wang, Li-Wei Chen, Cheng-Kang Chou, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang

TL;DR

This work tackles ASR performance drop due to channel mismatch by introducing CADA-GAN, a channel-aware domain-adaptive generative framework. A dedicated channel encoder extracts target-domain embeddings from limited unpaired data, which condition a GAN-based speech synthesizer to transform source-domain speech into target-channel speech while preserving phonetic content, aided by patch-wise contrastive learning and a channel reconstruction loss. The method yields notable relative CER reductions on Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora (20.02% and 9.64%, respectively) and sharper MOS scores, demonstrating robust cross-channel generalization with minimal target data. The approach provides a practical path to channel-robust ASR that can be integrated with strong downstream models and extended to broader datasets and larger ASR architectures in the future.

Abstract

While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics.

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

TL;DR

This work tackles ASR performance drop due to channel mismatch by introducing CADA-GAN, a channel-aware domain-adaptive generative framework. A dedicated channel encoder extracts target-domain embeddings from limited unpaired data, which condition a GAN-based speech synthesizer to transform source-domain speech into target-channel speech while preserving phonetic content, aided by patch-wise contrastive learning and a channel reconstruction loss. The method yields notable relative CER reductions on Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora (20.02% and 9.64%, respectively) and sharper MOS scores, demonstrating robust cross-channel generalization with minimal target data. The approach provides a practical path to channel-robust ASR that can be integrated with strong downstream models and extended to broader datasets and larger ASR architectures in the future.

Abstract

While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics.
Paper Structure (16 sections, 4 equations, 3 figures, 4 tables)

This paper contains 16 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The architecture of our proposed method, CADA-GAN. The dotted arrows indicate that during the training phase, simulated speech $\mathbf{X}^G$ is used together with target speech $\mathbf{X}^T$ to 1) train the discriminator, and 2) contribute to channel reconstruction. The $\bigoplus$ operator denotes element-wise tensor addition.
  • Figure 2: The UMAP visualization of channel embeddings extracted from eight channel types in the HAT corpus and six channel types in the TAT corpus.
  • Figure 3: Validation loss of our channel encoder on the HAT corpus, alongside the average pairwise Euclidean distance between channel embeddings, with respect to the number of training epochs.