Table of Contents
Fetching ...

Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

Jakob Poncelet, Yujun Wang, Hugo Van hamme

TL;DR

The paper addresses the sensitivity of discrete units derived from self-supervised speech models to noise and reverberation. It introduces a parameter-efficient denoiser (external or AdaDenoiser with adapters) that denoises SSL features and produces clean, deduplicated discrete units without finetuning the backbone, enabling robust discretisation and downstream ASR. Across LibriSpeech-based benchmarks and unseen distortions, the proposed approach improves Unit Error Rate and Word Error Rate while requiring relatively few trainable parameters and enabling effective test-time adaptation. This method offers a practical path to robust speech discretisation in real-world environments with limited labeled data for target domains.

Abstract

Continuous speech can be converted into a discrete sequence by deriving discrete units from the hidden features of self-supervised learned (SSL) speech models. Although SSL models are becoming larger and trained on more data, they are often sensitive to real-life distortions like additive noise or reverberation, which translates to a shift in discrete units. We propose a parameter-efficient approach to generate noise-robust discrete units from pre-trained SSL models by training a small encoder-decoder model, with or without adapters, to simultaneously denoise and discretise the hidden features of the SSL model. The model learns to generate a clean discrete sequence for a noisy utterance, conditioned on the SSL features. The proposed denoiser outperforms several pre-training methods on the tasks of noisy discretisation and noisy speech recognition, and can be finetuned to the target environment with a few recordings of unlabeled target data.

Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models

TL;DR

The paper addresses the sensitivity of discrete units derived from self-supervised speech models to noise and reverberation. It introduces a parameter-efficient denoiser (external or AdaDenoiser with adapters) that denoises SSL features and produces clean, deduplicated discrete units without finetuning the backbone, enabling robust discretisation and downstream ASR. Across LibriSpeech-based benchmarks and unseen distortions, the proposed approach improves Unit Error Rate and Word Error Rate while requiring relatively few trainable parameters and enabling effective test-time adaptation. This method offers a practical path to robust speech discretisation in real-world environments with limited labeled data for target domains.

Abstract

Continuous speech can be converted into a discrete sequence by deriving discrete units from the hidden features of self-supervised learned (SSL) speech models. Although SSL models are becoming larger and trained on more data, they are often sensitive to real-life distortions like additive noise or reverberation, which translates to a shift in discrete units. We propose a parameter-efficient approach to generate noise-robust discrete units from pre-trained SSL models by training a small encoder-decoder model, with or without adapters, to simultaneously denoise and discretise the hidden features of the SSL model. The model learns to generate a clean discrete sequence for a noisy utterance, conditioned on the SSL features. The proposed denoiser outperforms several pre-training methods on the tasks of noisy discretisation and noisy speech recognition, and can be finetuned to the target environment with a few recordings of unlabeled target data.
Paper Structure (18 sections, 2 figures, 3 tables)

This paper contains 18 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Model outline for a) offline target cluster extraction on clean data, b) Denoiser training on augmented data, c) AdaDenoiser training on augmented data, and d) discrete ASR modeling. In every schematic, the lighter blocks are frozen, and the darker blocks are trained.
  • Figure 2: UERs (%) after finetuning a pre-trained HuBERT base denoiser model on 30s noise samples from a target environment, and evaluating on unseen noise samples recorded in the same environment. The environments are Shopping Mall and Construction Site (evaluated at 10dB), and Car and Train (evaluated at 5dB).