EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation
Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann
TL;DR
The paper addresses the need for a high-quality, diverse, fullband speech dataset to advance speech enhancement and dereverberation research. It introduces the EARS dataset, featuring ~100 hours of anechoic 48 kHz speech from 107 speakers across rich speaking styles and emotions, plus EARS-WHAM and EARS-Reverb benchmarks for noisy and reverberant conditions. The authors benchmark predictive and generative methods, including Conv-TasNet, CDiffuSE, Demucs, and SGMSE+, and validate results with objective metrics, a listening test, and a blind online-evaluation server. The work provides a significant, reproducible platform for fair comparison and rapid development of robust speech processing systems with practical online evaluation capability.
Abstract
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics. In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data. Dataset download links and automatic evaluation server can be found online.
