Table of Contents
Fetching ...

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

Dimme de Groot, Yuanyuan Zhang, Jorge Martinez, Odette Scharenborg

TL;DR

It is found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions, and no positive effect of modern single-channel SE on ASR performance is found.

Abstract

We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on ASR performance, emphasizing the importance of evaluating in realistic conditions.

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

TL;DR

It is found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions, and no positive effect of modern single-channel SE on ASR performance is found.

Abstract

We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on ASR performance, emphasizing the importance of evaluating in realistic conditions.
Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An example picture-card (a) and prompt-card (b). (c): The four-channel microphone array used during the recording.
  • Figure 2: The distribution of the mean-opinion-score (MOS) for the recordings at the four recording locations, estimated using DNSMOS. Higher values imply a better quality. 'Overall' represents the combined results over all locations.
  • Figure 3: Left: Estimated distribution of DNSMOS for the baseline signal and each of the SE algorithms. Higher values imply a better overall quality. Right: DNSMOS score improvements with respect to the baseline for each of the SE algorithms.