DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, Hank Liao
TL;DR
DiarizationLM presents a post-processing approach that uses a finetuned large language model to refine and correct speaker diarization outputs by operating on a compact text representation of ASR and diarization results. The framework introduces a Transcript-Preserving Speaker Transfer (TPST) and three finetuning flavors (hyp2ora, deg2ref, mixed) to train the LLM to reduce Word Diarization Error Rate ($WDER$) without altering the underlying ASR transcripts. Experiments on Fisher and Callhome show substantial $WDER$ improvements, with up to ~55% reduction on Fisher and ~45% on Callhome when using PaLM 2-S in a finetuned setting, while zero-shot/one-shot baselines perform poorly without task-specific training. The work demonstrates a flexible, model-agnostic post-processing pipeline that can enhance diarization in real-world systems and opens pathways for multilingual and broader-task extensions using LLMs.
Abstract
In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
