Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Jiawen Kang; Lingwei Meng; Mingyu Cui; Yuejiao Wang; Xixin Wu; Xunying Liu; Helen Meng

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Jiawen Kang, Lingwei Meng, Mingyu Cui, Yuejiao Wang, Xixin Wu, Xunying Liu, Helen Meng

TL;DR

This paper investigates the role of Connectionist Temporal Classification (CTC) in multi-talker ASR and introduces Speaker-Aware CTC (SACTC) within a Bayes risk framework to explicitly model speaker disentanglement. By combining SACTC with Serialized Output Training (SOT), the authors achieve consistent WER gains across overlap degrees, with relative improvements of about 10% overall and 15% on low-overlap speech. The work represents the first exploration of CTC-based enhancements for MTASR and provides a new perspective on aligning speaker tokens to specific time frames in the encoder. These findings offer a practical path toward more robust MTASR systems and motivate future work on streaming and non-autoregressive extensions.

Abstract

Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition. The code is available at https://github.com/kjw11/Speaker-Aware-CTC.

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

TL;DR

Abstract

Paper Structure (10 sections, 11 equations, 3 figures, 3 tables)

This paper contains 10 sections, 11 equations, 3 figures, 3 tables.

Introduction
Methods
Revisit CTC in speech recognition
Speaker-aware CTC based on minimizing Bayes risk
Experimental setup
Results and discussions
Analysis of vanilla CTC
Performance of SACTC
conclusions
Acknowledgements

Figures (3)

Figure 1: A simplified illustration of the proposed speaker-aware risk function with CTC lattice. Red area indicates high risk and green for low risk. Tokens 1 and 2,3,4 are from different speakers. Two encouraged alignments are shown as examples.
Figure 2: Visualization of top-50 attended frames for two speakers (red and blue colors). Purple colors represent two speakers attending simultaneously.
Figure 3: Attention matrices in the last conformer blocks of SOT (a) and SOT-CTC (b) models. In (b), the overlapped area was encoded into separate output frames.

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

TL;DR

Abstract

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Authors

TL;DR

Abstract

Table of Contents

Figures (3)