THCHS-30 : A Free Chinese Speech Corpus
Dong Wang, Xuewei Zhang
TL;DR
The paper addresses the barrier of access to large-scale data in Chinese ASR by releasing THCHS-30, a freely available Chinese speech corpus with accompanying lexicon, language models, scripts, and noise data. It details dataset characteristics, resources, and a Kaldi-based baseline ASR system, demonstrating performance on clean and highly noisy conditions and showing that a DAE-based front-end can mitigate noise effects. The contribution includes not only a practical free dataset but also a call for community challenges to benchmark progress on large-vocabulary and phoneme recognition under $0$ dB noise. This work lowers entry barriers for new researchers and provides a standard reference for Chinese ASR development and evaluation, potentially accelerating progress in both academia and industry.
Abstract
Speech data is crucially important for speech recognition research. There are quite some speech databases that can be purchased at prices that are reasonable for most research institutes. However, for young people who just start research activities or those who just gain initial interest in this direction, the cost for data is still an annoying barrier. We support the `free data' movement in speech recognition: research institutes (particularly supported by public funds) publish their data freely so that new researchers can obtain sufficient data to kick of their career. In this paper, we follow this trend and release a free Chinese speech database THCHS-30 that can be used to build a full- edged Chinese speech recognition system. We report the baseline system established with this database, including the performance under highly noisy conditions.
