Table of Contents
Fetching ...

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

Ravi Shankar, Ke Tan, Buye Xu, Anurag Kumar

TL;DR

This paper investigates the uses of SSL representations for single-channel speech enhancement in challenging conditions and establishes the impact they can have on the enhancement task.

Abstract

Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training.

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

TL;DR

This paper investigates the uses of SSL representations for single-channel speech enhancement in challenging conditions and establishes the impact they can have on the enhancement task.

Abstract

Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training.
Paper Structure (20 sections, 5 equations, 5 figures, 3 tables)

This paper contains 20 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: GCRN model and different modes of using SSL embeddings to guide enhancement model. The right panel (red) shows our pre-training modes. The left panel (green) shows knowledge distillation modes and uses of SSL embeddings as inputs.
  • Figure 2: Different ways of knowledge distillation from Wav2Vec2 embeddings used in this paper. (a) Sample-wise distillation via encoder output, (b) distillation via enhanced signal (c) adversarial distillation, and (d) triplet loss based distillation.
  • Figure 3: Wav2Vec2: Box plot of (a) correlations and (b) Euclidean distances obtained from frames of Wav2Vec2 embeddings separated by 20ms, 60ms, 400ms, 1sec and 2sec.
  • Figure 4: Distilled model: Box plot of (a) correlations and (b) Euclidean distances obtained from frames of Wav2Vec2 embeddings separated by 20ms, 60ms, 400ms, 1sec and 2sec.
  • Figure 5: Illustration of spectrograms: (a) ground truth speech (b) speech decoded using Wav2Vec2 embeddings and (c) speech decoded using embeddings extracted from the trained knowledge distillation model (same encoder as Wav2Vec2).