Table of Contents
Fetching ...

SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models

Anurag Kumar, Rohit Paturi, Amber Afshan, Sundararajan Srinivasan

TL;DR

The paper addresses speaker attribution errors in multi-speaker transcripts by integrating acoustic information from a first-pass SD module into a fine-tuned LLM for speaker error correction. It introduces SEAL, which uses acoustic conditioning by mapping frame-level SD posteriors to text-friendly labels and applies Constrained Decoding to keep outputs faithful to the input transcript while adjusting speaker labels. Across Fisher, Callhome, and RT03-CTS, SEAL achieves 24–43% relative reductions in speaker errors compared with the first-pass Acoustic SD, with the LLMAC-label and CD in Spkword format delivering notable gains. This approach improves robustness to overlaps and conversational variability and can be extended to additional languages and domains as stronger LLMs become available.

Abstract

Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD.

SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models

TL;DR

The paper addresses speaker attribution errors in multi-speaker transcripts by integrating acoustic information from a first-pass SD module into a fine-tuned LLM for speaker error correction. It introduces SEAL, which uses acoustic conditioning by mapping frame-level SD posteriors to text-friendly labels and applies Constrained Decoding to keep outputs faithful to the input transcript while adjusting speaker labels. Across Fisher, Callhome, and RT03-CTS, SEAL achieves 24–43% relative reductions in speaker errors compared with the first-pass Acoustic SD, with the LLMAC-label and CD in Spkword format delivering notable gains. This approach improves robustness to overlaps and conversational variability and can be extended to additional languages and domains as stronger LLMs become available.

Abstract

Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD.
Paper Structure (12 sections, 2 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 2 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Proposed framework of SEAL.
  • Figure 2: Different input transcript formats with acoustic score mapped to 3 labels: low, med, high.
  • Figure 3: A Qualitative example of the incremental speaker error corrections with each of the proposed strategies. Errors are shown in red and the incremental corrections in green.