Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

William Ravenscroft; George Close; Stefan Goetze; Thomas Hain; Mohammad Soleymanpour; Anurag Chowdhury; Mark C. Fuhs

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

TL;DR

A transcription-free method for joint training using only audio signals called guided PIT (GPIT), which achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).

Abstract

One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 3 figures, 3 tables)

This paper contains 17 sections, 6 equations, 3 figures, 3 tables.

Introduction
Transcription-Free Fine-Tuning Method
ASR Encoder Loss
Guided Permutation Invariant Training (GPIT)
Experimental Setup
Data
Speech Separator
Speech Recognizers
Fine-Tuning
Evaluation Metrics
Results
Results on clean targets
Generalization to an Unseen ASR System
Joint SISDR Loss Weighting
Training Signal Length Analysis
...and 2 more sections

Figures (3)

Figure 1: (a) Baseline approach to training speech separators without ASR-based fine-tuning. (b) Proposed fine-tuning approach without using reference transcriptions. Solid lines indicate information flow; dashed lines the direction of gradient backpropagation. Figure exemplary for $C=2$ speakers.
Figure 2: ASR performance (CP-WER) and objective perceptual quality of test audio for models trained with differing weight $\alpha$ between loss terms in (\ref{['eq:joint_loss']}).
Figure 3: Wav2Vec2 ASR Encoder output representations for reference audio $\mathcal{V}(s[n])$ (top), and $\mathcal{V}(\hat{s}[n])$ for models with baseline $\mathcal{L}_\mathrm{SISDR}$ (middle) and proposed $\mathcal{L}_\mathrm{AE}$ fine-tuning (bottom) losses

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

TL;DR

Abstract

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)