Table of Contents
Fetching ...

Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR

Yuchen Hu, Chen Chen, Qiushi Zhu, Eng Siong Chng

TL;DR

Wav2code addresses noise-robust ASR by introducing a self-supervised framework that stores clean speech priors in a discrete codebook and uses a Transformer-based code predictor to restore clean representations from noisy inputs. The framework then fuses restored and original noisy features through IFF-Net to balance fidelity and quality for downstream ASR, achieving state-of-the-art robustness on LibriSpeech-FS and CHiME-4. The key ideas are pre-training a clean-prior codebook from SSL representations, predicting codes with global dependencies, and interactive fusion to mitigate distortion while preserving information critical for accurate recognition.

Abstract

Automatic speech recognition (ASR) has gained remarkable successes thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works combine SE and currently popular self-supervised learning (SSL) to alleviate distortion and improve noise robustness. Despite the effectiveness, the speech distortion caused by conventional SE still cannot be cleared out. In this paper, we propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR. First, in pre-training stage the clean speech representations from SSL model are sent to lookup a discrete codebook via nearest-neighbor feature matching, the resulted code sequence are then exploited to reconstruct the original clean representations, in order to store them in codebook as prior. Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions. Furthermore, we propose an interactive feature fusion network to combine original noisy and the restored clean representations to consider both fidelity and quality, resulting in more informative features for downstream ASR. Finally, experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions, resulting in stronger robustness.

Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR

TL;DR

Wav2code addresses noise-robust ASR by introducing a self-supervised framework that stores clean speech priors in a discrete codebook and uses a Transformer-based code predictor to restore clean representations from noisy inputs. The framework then fuses restored and original noisy features through IFF-Net to balance fidelity and quality for downstream ASR, achieving state-of-the-art robustness on LibriSpeech-FS and CHiME-4. The key ideas are pre-training a clean-prior codebook from SSL representations, predicting codes with global dependencies, and interactive fusion to mitigate distortion while preserving information critical for accurate recognition.

Abstract

Automatic speech recognition (ASR) has gained remarkable successes thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works combine SE and currently popular self-supervised learning (SSL) to alleviate distortion and improve noise robustness. Despite the effectiveness, the speech distortion caused by conventional SE still cannot be cleared out. In this paper, we propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR. First, in pre-training stage the clean speech representations from SSL model are sent to lookup a discrete codebook via nearest-neighbor feature matching, the resulted code sequence are then exploited to reconstruct the original clean representations, in order to store them in codebook as prior. Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions. Furthermore, we propose an interactive feature fusion network to combine original noisy and the restored clean representations to consider both fidelity and quality, resulting in more informative features for downstream ASR. Finally, experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions, resulting in stronger robustness.
Paper Structure (18 sections, 15 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 15 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of Enhanced Wav2vec2.0. It is pre-trained with contrastive loss between masked noisy speech representations $Z_n$ and quantized clean targets $q_c$, and then finetuned with a linear output layer and CTC loss graves2006connectionist.
  • Figure 2: Illustration of the proposed Wav2code framework. (a) Pre-training stage: codebook learning via nearest-neighbor matching to store clean speech prior. (b) Finetuning stage: code prediction to restore clean speech representations from prior for downstream ASR. Solid arrows indicate direct data flow, and dashed arrows indicate mapping relationship.
  • Figure 3: Illustration of proposed interaction feature fusion network (IFF-Net). Encoder and Decoder are used to compress and recover the number of feature channels like bottleneck. ResNet block is employed to capture local context, and Separable Self-Attention (SSA) module is exploited to model global dependencies. Interaction module is designed to interact between two branch of features, which are finally fused by Merge module to generate output.
  • Figure 4: Comparison on code prediction accuracy under different SNR levels. The results have been averaged over all FreeSound noise types.
  • Figure 5: t-SNE visualizations of clean/noisy speech features and codebook entries. Features of same color are extracted from parallel clean/noisy speech data. Examples show that noise corruption increases the diversity and uncertainty of speech features, making it challenging for correct code assignment.
  • ...and 2 more figures