Table of Contents
Fetching ...

The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge

Jingguang Tian, Shuaishuai Ye, Shunfei Chen, Yang Xiang, Zhaohui Yin, Xinhui Hu, Xinkang Xu

TL;DR

The paper addresses the challenge of in-car multi-speaker automatic speech recognition by presenting an end-to-end ASDR system that combines TS-VAD-based speaker diarization with front-end enhancement and a self-supervised HuBERT-based ASR. The system leverages four TS-VAD models and guided source separation, fused via DOVER-Lap, to achieve precise speaker segmentation, while an SE-HuBERT end-to-end ASR with a joint CTC/attention decoder and an LM delivers robust transcription under noisy, overlapping conditions. Quantitatively, the approach yields a 49.58% absolute reduction in DER over the official baseline, a Track 1 CER of 16.93% on the evaluation set, and a Track 2 cpCER of 25.88% on the evaluation set, demonstrating strong gains from multi-model fusion (ROVER/DOVER) and end-to-end optimization. These results underscore the practical viability of integrated diarization and ASR pipelines for in-car systems, enabling reliable voice interfaces in challenging acoustic environments.

Abstract

This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.

The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge

TL;DR

The paper addresses the challenge of in-car multi-speaker automatic speech recognition by presenting an end-to-end ASDR system that combines TS-VAD-based speaker diarization with front-end enhancement and a self-supervised HuBERT-based ASR. The system leverages four TS-VAD models and guided source separation, fused via DOVER-Lap, to achieve precise speaker segmentation, while an SE-HuBERT end-to-end ASR with a joint CTC/attention decoder and an LM delivers robust transcription under noisy, overlapping conditions. Quantitatively, the approach yields a 49.58% absolute reduction in DER over the official baseline, a Track 1 CER of 16.93% on the evaluation set, and a Track 2 cpCER of 25.88% on the evaluation set, demonstrating strong gains from multi-model fusion (ROVER/DOVER) and end-to-end optimization. These results underscore the practical viability of integrated diarization and ASR pipelines for in-car systems, enabling reliable voice interfaces in challenging acoustic environments.

Abstract

This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.
Paper Structure (13 sections, 1 figure, 2 tables)

This paper contains 13 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: (a) The overview of the ASDR system; (b) Data flow for speaker diarization training; (c) Data flow for ASR training.