Table of Contents
Fetching ...

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, Eng Siong Chng

TL;DR

This work proposes a end-to-end model that separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier and generalizes well to both seen and unseen acoustic conditions.

Abstract

Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

TL;DR

This work proposes a end-to-end model that separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier and generalizes well to both seen and unseen acoustic conditions.

Abstract

Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.

Paper Structure

This paper contains 33 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Proposed model architecture. The dotted rectangles highlight two main components: (1) an extended speech encoder (whisper encoder and latent encoder) and latent decoder that improves alignment between clean speech tokens and text, and (2) a noise disentanglement module that guides the quantization residue to model background noise.
  • Figure 2: Latent encoder (orange). Whisper produces 50 embeddings/sec, while our encoder downsamples to 25 tokens/sec.
  • Figure 3: Mean L2 Distance: Clean vs Noisy Embeddings: (top) Clean audio waveform from Chime-4 test dataset. The red dotted line represents an amplitude of 0. (bottom) The color at each position represents L2 distance of embeddings from our encoder between the clean signal and the clean signal mixed with various background noises. The y-axis indicates the mixed-in background noise, and the x-axis represents the time-steps. Lower values show that the encoder is able to generate more noise invariant representations for sub-word tokens and token sequences similar to those captured from clean speech.
  • Figure 4: (Top) T-SNE projections of penultimate-layer features from the noise classifier on VBDemand validation samples. Some samples were relabeled as "unknown" to encourage generalization. These noise types are seen during training. (Bottom) Similar projections from clean and noisy CHiME-4 test samples. Clean speech yields uniform embeddings; noisy speech shows class-specific clustering. CHiME-4 noise types include café (green), bus (light blue), pedestrian (yellow), and street (dark blue), which are unseen during training.