A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

Dimme de Groot; Yuanyuan Zhang; Jorge Martinez; Odette Scharenborg

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

Dimme de Groot, Yuanyuan Zhang, Jorge Martinez, Odette Scharenborg

TL;DR

It is found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions, and no positive effect of modern single-channel SE on ASR performance is found.

Abstract

We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on ASR performance, emphasizing the importance of evaluating in realistic conditions.

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

TL;DR

It is found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions, and no positive effect of modern single-channel SE on ASR performance is found.

Abstract

Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Introduction
The DRES corpus
Corpus design
The speech recordings
Participants
Transcriptions
The quality of the speech data
Experiments
Speech enhancement algorithms
Automatic speech recognition models
Evaluation metrics
Results
Speech quality
ASR performance
Discussion and conclusions
...and 2 more sections

Figures (3)

Figure 1: An example picture-card (a) and prompt-card (b). (c): The four-channel microphone array used during the recording.
Figure 2: The distribution of the mean-opinion-score (MOS) for the recordings at the four recording locations, estimated using DNSMOS. Higher values imply a better quality. 'Overall' represents the combined results over all locations.
Figure 3: Left: Estimated distribution of DNSMOS for the baseline signal and each of the SE algorithms. Higher values imply a better overall quality. Right: DNSMOS score improvements with respect to the baseline for each of the SE algorithms.

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

TL;DR

Abstract

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)