Table of Contents
Fetching ...

AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen

TL;DR

The paper addresses robust in-car ASR under realistic acoustic conditions by introducing AISHELL-5, a large open-source, multi-channel, multi-speaker Mandarin dataset collected inside a vehicle across 60 driving scenarios, complemented by a 40-hour noise corpus. It presents a reproducible baseline that combines a front-end speech separation and dereverberation module (including AEC, IVA, and Spatialnet NBSS) with a standard ASR backend, and evaluates several ASR models under Eval1/Eval2 tracks inspired by the ICMC-ASR challenge. Key findings show significant gains from front-end processing (e.g., Spatialnet) and reveal remaining gaps for state-of-the-art models in multi-speaker, far-field in-car settings, with Paraformer performing well after fine-tuning. The dataset and baseline together offer a practical benchmark to advance diarization, separation, and ASR in real driving contexts, shaping future research and development in human-vehicle interaction.

Abstract

This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech recognition module that accurately transcribes the content of each individual speaker. Experimental results demonstrate the challenges faced by various mainstream ASR models when evaluated on the AISHELL-5. We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark.

AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

TL;DR

The paper addresses robust in-car ASR under realistic acoustic conditions by introducing AISHELL-5, a large open-source, multi-channel, multi-speaker Mandarin dataset collected inside a vehicle across 60 driving scenarios, complemented by a 40-hour noise corpus. It presents a reproducible baseline that combines a front-end speech separation and dereverberation module (including AEC, IVA, and Spatialnet NBSS) with a standard ASR backend, and evaluates several ASR models under Eval1/Eval2 tracks inspired by the ICMC-ASR challenge. Key findings show significant gains from front-end processing (e.g., Spatialnet) and reveal remaining gaps for state-of-the-art models in multi-speaker, far-field in-car settings, with Paraformer performing well after fine-tuning. The dataset and baseline together offer a practical benchmark to advance diarization, separation, and ASR in real driving contexts, shaping future research and development in human-vehicle interaction.

Abstract

This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech recognition module that accurately transcribes the content of each individual speaker. Experimental results demonstrate the challenges faced by various mainstream ASR models when evaluated on the AISHELL-5. We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark.

Paper Structure

This paper contains 7 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Structure of our baseline system, including train process and inference process