ASR-Synchronized Speaker-Role Diarization
Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna
TL;DR
The paper tackles the practical need for role-based diarization in multi-speaker conversations by adapting ASR-synchronized diarization to predict speaker roles (e.g., doctor vs. patient) in parallel with ASR. It demonstrates that role prediction relies more on linguistic context than acoustic cues and thus introduces task-specific predictors and higher-layer ASR features, along with a training strategy using a 1-best forced-alignment path with cross-entropy loss. Empirical results on DoPaCo and SiMeCo show consistent improvements in role-based word diarization error rate (R-WDER) over baselines, with the final P3 model delivering the best performance and faster training. The work enables more informative transcripts for downstream NLP tasks and highlights a practical approach to integrating lexical and linguistic cues in a jointly trained ASR+RD framework.
Abstract
Speaker-role diarization (RD), such as doctor vs. patient or lawyer vs. client, is practically often more useful than conventional speaker diarization (SD), which assigns only generic labels (speaker-1, speaker-2). The state-of-the-art end-to-end ASR+RD approach uses a single transducer that serializes word and role predictions (role at the end of a speaker's turn), but at the cost of degraded ASR performance. To address this, we adapt a recent joint ASR+SD framework to ASR+RD by freezing the ASR transducer and training an auxiliary RD transducer in parallel to assign a role to each ASR-predicted word. For this, we first show that SD and RD are fundamentally different tasks, exhibiting different dependencies on acoustic and linguistic information. Motivated by this, we propose (1) task-specific predictor networks and (2) using higher-layer ASR encoder features as input to the RD encoder. Additionally, we replace the blank-shared RNNT loss by cross-entropy loss along the 1-best forced-alignment path to further improve performance while reducing computational and memory requirements during RD training. Experiments on a public and a private dataset of doctor-patient conversations demonstrate that our method outperforms the best baseline with relative reductions of 6.2% and 4.5% in role-based word diarization error rate (R-WDER), respectively
