Table of Contents
Fetching ...

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

Adam Sabra, Cyprian Wronka, Michelle Mao, Samer Hijazi

TL;DR

The paper tackles scalable onboarding of clean speech for supervised SE/TTS under heavy annotation demands. It introduces SECP, an iterative pipeline combining a non-causal U-Net-based SE model and a non-causal VAD to identify high-SNR segments and format data for downstream tasks, retraining across two rounds. The work shows $\Delta_{PESQ}$ improvements across four internal data sets and $21$ noise types, with CMOS-based subjective gains, and demonstrates that enhanced ground-truth data does not degrade performance. Overall, SECP provides a practical framework for expanding clean-speech corpora and improving SE models with minimal manual intervention, with potential applicability to SE and TTS research.

Abstract

As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to $Δ_{PESQ}$, a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data.

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

TL;DR

The paper tackles scalable onboarding of clean speech for supervised SE/TTS under heavy annotation demands. It introduces SECP, an iterative pipeline combining a non-causal U-Net-based SE model and a non-causal VAD to identify high-SNR segments and format data for downstream tasks, retraining across two rounds. The work shows improvements across four internal data sets and noise types, with CMOS-based subjective gains, and demonstrates that enhanced ground-truth data does not degrade performance. Overall, SECP provides a practical framework for expanding clean-speech corpora and improving SE models with minimal manual intervention, with potential applicability to SE and TTS research.

Abstract

As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to , a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data.
Paper Structure (9 sections, 3 equations, 3 figures)

This paper contains 9 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: A high level overview of the proposed iterative process in which both the data and speech enhancement model improve each other.
  • Figure 2: The number of accepted curated hours between rounds (a) with the comparison of $\Delta_{PESQ}$ scores between model training rounds across various noise types (b).
  • Figure 3: $\hat{\rho}$ distribution of approved one second segements between rounds (a), as well as subjective test results comparing lowest bound of performance (b) and the highest bound of performance (c) to the original unprocessed files.