The USTC-NERCSLIP Systems for The ICMC-ASR Challenge
Minghui Wu, Luzhen Xu, Jie Zhang, Haitao Tang, Yanyan Yue, Ruizhi Liao, Jintao Zhao, Zhengzhe Zhang, Yichi Wang, Haoyin Yan, Hongliang Yu, Tongle Ma, Jiachen Liu, Chongliang Wu, Yongchao Li, Yanyong Zhang, Xin Fang, Yue Zhang
TL;DR
The paper tackles robust ASR in the ICMC-ASR challenge, where multi-speaker overlap and Mandarin accent dynamics are challenging. It presents an integrated pipeline combining front-end guided source separation with MVDR beamforming, a large-scale semi-supervised data expansion via pseudo-label generation on fusion encoders, a multi-speaker diarization system that fuses SSLR-based x-vectors with MC-TS-VAD, and an Accent-ASR framework that jointly models pronunciation and linguistic information. Key contributions include effective interference suppression in the front-end, scalable PLG-based data augmentation, and accent-aware ASR that improves recognition under varying accents, achieving state-of-the-art results on Track 1 and Track 2. The results demonstrate the practical impact of combining advanced front-end processing, semi-supervised learning, and accent-aware modeling for real-world multi-speaker Mandarin ASR.
Abstract
This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position, respectively. For ASR, we employ an iterative pseudo-label generation method based on fusion model to obtain text labels of unsupervised data. To mitigate the impact of accent, an Accent-ASR framework is proposed, which captures pronunciation-related accent features at a fine-grained level and linguistic information at a coarse-grained level. On the ICMC-ASR eval set, the proposed system achieves a CER of 13.16% on track 1 and a cpCER of 21.48% on track 2, which significantly outperforms the official baseline system and obtains the first rank on both tracks.
