Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge
Shangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng
TL;DR
The paper tackles audio-visual diarization and recognition in the MISP 2025 Challenge by integrating a hybrid diarization pipeline that combines WavLM-based end-to-end segmentation with VBx clustering, and an ASR-aware observation addition (OA) framework to compensate Guided Source Separation distortions in low-SNR settings. The ASR component uses a three-path signal fusion and a sentence-level bridging module supervised by ASR to optimize OA coefficients, achieving a CER of $5.91\%$ (Dev) and $10.09\%$ (Eval), further improved to $5.54\%$ and $9.48\%$ with ROVER. Through cascaded integration, the system attains a cpCER of $11.56\%$ for Track 3, placing first in Track 2 and Track 3, and demonstrating strong performance in real-world, multi-speaker meetings. The study also analyzes the impact of overlap handling, dereverberation, and front-end speech enhancement, while noting limited gains from video information under adverse visual conditions. Overall, the proposed methods advance practical AVDR performance in noisy, overlapped meeting scenarios.
Abstract
This paper presents the system developed to address the MISP 2025 Challenge. For the diarization system, we proposed a hybrid approach combining a WavLM end-to-end segmentation method with a traditional multi-module clustering technique to adaptively select the appropriate model for handling varying degrees of overlapping speech. For the automatic speech recognition (ASR) system, we proposed an ASR-aware observation addition method that compensates for the performance limitations of Guided Source Separation (GSS) under low signal-to-noise ratio conditions. Finally, we integrated the speaker diarization and ASR systems in a cascaded architecture to address Track 3. Our system achieved character error rates (CER) of 9.48% on Track 2 and concatenated minimum permutation character error rate (cpCER) of 11.56% on Track 3, ultimately securing first place in both tracks and thereby demonstrating the effectiveness of the proposed methods in real-world meeting scenarios.
