Table of Contents
Fetching ...

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models

Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, Hank Liao

TL;DR

DiarizationLM presents a post-processing approach that uses a finetuned large language model to refine and correct speaker diarization outputs by operating on a compact text representation of ASR and diarization results. The framework introduces a Transcript-Preserving Speaker Transfer (TPST) and three finetuning flavors (hyp2ora, deg2ref, mixed) to train the LLM to reduce Word Diarization Error Rate ($WDER$) without altering the underlying ASR transcripts. Experiments on Fisher and Callhome show substantial $WDER$ improvements, with up to ~55% reduction on Fisher and ~45% on Callhome when using PaLM 2-S in a finetuned setting, while zero-shot/one-shot baselines perform poorly without task-specific training. The work demonstrates a flexible, model-agnostic post-processing pipeline that can enhance diarization in real-world systems and opens pathways for multilingual and broader-task extensions using LLMs.

Abstract

In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models

TL;DR

DiarizationLM presents a post-processing approach that uses a finetuned large language model to refine and correct speaker diarization outputs by operating on a compact text representation of ASR and diarization results. The framework introduces a Transcript-Preserving Speaker Transfer (TPST) and three finetuning flavors (hyp2ora, deg2ref, mixed) to train the LLM to reduce Word Diarization Error Rate () without altering the underlying ASR transcripts. Experiments on Fisher and Callhome show substantial improvements, with up to ~55% reduction on Fisher and ~45% on Callhome when using PaLM 2-S in a finetuned setting, while zero-shot/one-shot baselines perform poorly without task-specific training. The work demonstrates a flexible, model-agnostic post-processing pipeline that can enhance diarization in real-world systems and opens pathways for multilingual and broader-task extensions using LLMs.

Abstract

In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
Paper Structure (32 sections, 3 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 3 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: The orchestration module associates each word from the ASR transcript with a speaker label from the speaker diarization outputs. (a) In this example, all words are associated with the correct speaker labels (green arrows). The words "good", "morning", and "are" and "you" are associated with the only speaker label that overlap with them. The word "how" overlaps with both spk1 and spk2, but has bigger overlaps with spk2, thus is associated with spk2. The word "you" does not overlap with any speaker, but is closest to spk2, thus is associated with spk2. (b) In this example, two words are associated with wrong speaker labels (red arrows) due to inconsistent timing information from the two systems. The word "how" is mistakenly associated with spk1, since spk1 has more overlap with this word than spk2. The word "you" is mistakenly associated with spk1, since spk1 is closer to this word than spk2.
  • Figure 2: Diagram of the proposed DiarizationLM framework.
  • Figure 3: The histogram for the number of hypothesis speakers predicted by the turn-to-diarize system on the Fisher training set. Note that the ground truth number of speakers is always two on the Fisher dataset, but we do not constrain the number of speakers for the turn-to-diarize system.