Table of Contents
Fetching ...

LLM-based speaker diarization correction: A generalizable approach

Georgios Efstathiadis, Vijay Yadav, Anzar Abbas

TL;DR

The paper addresses the challenge of correcting speaker diarization errors in ASR transcripts by fine-tuning large language models (LLMs) as a post-processing step. It demonstrates that ASR-specific fine-tuned models substantially improve diarization on transcripts from the same ASR, but generalization to unseen ASRs is limited; to mitigate this, an ASR-agnostic ensemble merges weights from AWS-, Azure-, and WhisperX-tuned models, achieving broader robustness. The approach uses the Fisher corpus for training and PriMock57 as an independent dataset, with evaluation based on deltaCP and deltaSA metrics, and introduces a completion parser and the TPST-based oracle alignment to preserve transcription integrity. The authors publicly release the ensemble weights on HuggingFace and provide a user-friendly interface, highlighting practical applicability for clinical and conversational transcription workflows where diarization accuracy is critical.

Abstract

Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.

LLM-based speaker diarization correction: A generalizable approach

TL;DR

The paper addresses the challenge of correcting speaker diarization errors in ASR transcripts by fine-tuning large language models (LLMs) as a post-processing step. It demonstrates that ASR-specific fine-tuned models substantially improve diarization on transcripts from the same ASR, but generalization to unseen ASRs is limited; to mitigate this, an ASR-agnostic ensemble merges weights from AWS-, Azure-, and WhisperX-tuned models, achieving broader robustness. The approach uses the Fisher corpus for training and PriMock57 as an independent dataset, with evaluation based on deltaCP and deltaSA metrics, and introduces a completion parser and the TPST-based oracle alignment to preserve transcription integrity. The authors publicly release the ensemble weights on HuggingFace and provide a user-friendly interface, highlighting practical applicability for clinical and conversational transcription workflows where diarization accuracy is critical.

Abstract

Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.
Paper Structure (27 sections, 3 figures, 7 tables)

This paper contains 27 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Accurate speaker diarization is necessary for interpretation of important conversations.
  • Figure 2: Overview of pipeline used to evaluate the LLM.
  • Figure 3: Creation of oracle transcripts using the TPST algorithm. Words and speaker labels are extracted from each transcript. The algorithm aligns word sequences, such that the resulting speaker labels from the reference transcript match the text of the ASR transcript. This corrects speaker labeling in the ASR transcript without changing the underlying transcription.