The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge
Jingguang Tian, Shuaishuai Ye, Shunfei Chen, Yang Xiang, Zhaohui Yin, Xinhui Hu, Xinkang Xu
TL;DR
The paper addresses the challenge of in-car multi-speaker automatic speech recognition by presenting an end-to-end ASDR system that combines TS-VAD-based speaker diarization with front-end enhancement and a self-supervised HuBERT-based ASR. The system leverages four TS-VAD models and guided source separation, fused via DOVER-Lap, to achieve precise speaker segmentation, while an SE-HuBERT end-to-end ASR with a joint CTC/attention decoder and an LM delivers robust transcription under noisy, overlapping conditions. Quantitatively, the approach yields a 49.58% absolute reduction in DER over the official baseline, a Track 1 CER of 16.93% on the evaluation set, and a Track 2 cpCER of 25.88% on the evaluation set, demonstrating strong gains from multi-model fusion (ROVER/DOVER) and end-to-end optimization. These results underscore the practical viability of integrated diarization and ASR pipelines for in-car systems, enabling reliable voice interfaces in challenging acoustic environments.
Abstract
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.
