Table of Contents
Fetching ...

ASR-Synchronized Speaker-Role Diarization

Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna

TL;DR

The paper tackles the practical need for role-based diarization in multi-speaker conversations by adapting ASR-synchronized diarization to predict speaker roles (e.g., doctor vs. patient) in parallel with ASR. It demonstrates that role prediction relies more on linguistic context than acoustic cues and thus introduces task-specific predictors and higher-layer ASR features, along with a training strategy using a 1-best forced-alignment path with cross-entropy loss. Empirical results on DoPaCo and SiMeCo show consistent improvements in role-based word diarization error rate (R-WDER) over baselines, with the final P3 model delivering the best performance and faster training. The work enables more informative transcripts for downstream NLP tasks and highlights a practical approach to integrating lexical and linguistic cues in a jointly trained ASR+RD framework.

Abstract

Speaker-role diarization (RD), such as doctor vs. patient or lawyer vs. client, is practically often more useful than conventional speaker diarization (SD), which assigns only generic labels (speaker-1, speaker-2). The state-of-the-art end-to-end ASR+RD approach uses a single transducer that serializes word and role predictions (role at the end of a speaker's turn), but at the cost of degraded ASR performance. To address this, we adapt a recent joint ASR+SD framework to ASR+RD by freezing the ASR transducer and training an auxiliary RD transducer in parallel to assign a role to each ASR-predicted word. For this, we first show that SD and RD are fundamentally different tasks, exhibiting different dependencies on acoustic and linguistic information. Motivated by this, we propose (1) task-specific predictor networks and (2) using higher-layer ASR encoder features as input to the RD encoder. Additionally, we replace the blank-shared RNNT loss by cross-entropy loss along the 1-best forced-alignment path to further improve performance while reducing computational and memory requirements during RD training. Experiments on a public and a private dataset of doctor-patient conversations demonstrate that our method outperforms the best baseline with relative reductions of 6.2% and 4.5% in role-based word diarization error rate (R-WDER), respectively

ASR-Synchronized Speaker-Role Diarization

TL;DR

The paper tackles the practical need for role-based diarization in multi-speaker conversations by adapting ASR-synchronized diarization to predict speaker roles (e.g., doctor vs. patient) in parallel with ASR. It demonstrates that role prediction relies more on linguistic context than acoustic cues and thus introduces task-specific predictors and higher-layer ASR features, along with a training strategy using a 1-best forced-alignment path with cross-entropy loss. Empirical results on DoPaCo and SiMeCo show consistent improvements in role-based word diarization error rate (R-WDER) over baselines, with the final P3 model delivering the best performance and faster training. The work enables more informative transcripts for downstream NLP tasks and highlights a practical approach to integrating lexical and linguistic cues in a jointly trained ASR+RD framework.

Abstract

Speaker-role diarization (RD), such as doctor vs. patient or lawyer vs. client, is practically often more useful than conventional speaker diarization (SD), which assigns only generic labels (speaker-1, speaker-2). The state-of-the-art end-to-end ASR+RD approach uses a single transducer that serializes word and role predictions (role at the end of a speaker's turn), but at the cost of degraded ASR performance. To address this, we adapt a recent joint ASR+SD framework to ASR+RD by freezing the ASR transducer and training an auxiliary RD transducer in parallel to assign a role to each ASR-predicted word. For this, we first show that SD and RD are fundamentally different tasks, exhibiting different dependencies on acoustic and linguistic information. Motivated by this, we propose (1) task-specific predictor networks and (2) using higher-layer ASR encoder features as input to the RD encoder. Additionally, we replace the blank-shared RNNT loss by cross-entropy loss along the 1-best forced-alignment path to further improve performance while reducing computational and memory requirements during RD training. Experiments on a public and a private dataset of doctor-patient conversations demonstrate that our method outperforms the best baseline with relative reductions of 6.2% and 4.5% in role-based word diarization error rate (R-WDER), respectively

Paper Structure

This paper contains 15 sections, 6 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of prior work in (a) and (b), along with our proposed RD training in (c), for an example audio $\mathbf{x}$ with corresponding ground-truth label sequences: $\mathbf{y}^{\text{Role-ASR}} = [\text{hello}, \text{there}, \text{DOC}, \text{hi}, \text{PAT}]$, $\mathbf{y}^{\text{ASR}} = [\text{hello},\text{there},\text{hi}]$, $\mathbf{y}^{\text{SD}} = [\text{s}^1,\text{s}^1,\text{s}^2]$, and $\mathbf{y}^{\text{RD}} = [\text{DOC},\text{DOC},\text{PAT}]$. Blue colored ASR modules remain frozen during the training of auxiliary SD and RD transducers. The graphs at the top of each network shows the alignment paths used to compute the training loss. In (c), the path with bold arrows represent the 1-best forced-alignment path.
  • Figure 2: Effect of Role-ASR predictor's context length on (a) WER and (b) R-WDER when trained on DoPaCo and evaluated on the validation set of DoPaCo.
  • Figure 3: Activity of ASR and RD networks at each $(t,u)$-step of the best beam search path. The reference and hypothesis are shown at the top. The top three plots show the RD network's posteriors for DOC, PAT, and OTH. The bottom-most plot shows the ASR posteriors for the top token in blue (blank token omitted to avoid clutter) and the second-best token (for two specific regions) in orange.
  • Figure 4: DoPaCo validation set: reduction in WER when top-$n$ ($n \in [1,2,\ldots,10$) deletion-words are accounted for during RD-guided ASR decoding. The top-$10$ deletion-words for the validation set are: yeah, okay, I, and, it, you, a, the, oh, right.