Table of Contents
Fetching ...

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke

TL;DR

GenSEC presents a three-task challenge to evaluate large language model–driven post-ASR processing on text-only ASR outputs. It investigates transcription correction, speaker tagging, and emotion recognition by leveraging N-best ASR hypotheses and instruction-based prompting, without acoustic inputs. The paper provides baseline results across diverse datasets, discusses potential biases and reliability concerns, and outlines lessons for designing future evaluations. By standardizing datasets and metrics, GenSEC aims to catalyze research on LLM-enabled voice interfaces and motivate future multimodal extensions.

Abstract

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

TL;DR

GenSEC presents a three-task challenge to evaluate large language model–driven post-ASR processing on text-only ASR outputs. It investigates transcription correction, speaker tagging, and emotion recognition by leveraging N-best ASR hypotheses and instruction-based prompting, without acoustic inputs. The paper provides baseline results across diverse datasets, discusses potential biases and reliability concerns, and outlines lessons for designing future evaluations. By standardizing datasets and metrics, GenSEC aims to catalyze research on LLM-enabled voice interfaces and motivate future multimodal extensions.

Abstract

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
Paper Structure (10 sections, 5 figures, 4 tables)

This paper contains 10 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The framework for LLM postprocessing of "text representation of speech" via ASR-decoded information, for three tasks: speech recognition, speaker diarization, and emotion recognition.
  • Figure 2: Example Task 1 approach: post-speech recognition error correction with different techniques on LLMs.
  • Figure 3: Example Task 2 approach based on beam-search decoding for speaker tagging park2023enhancing
  • Figure 4: Dataflow for the Task 2 baseline. Note that the acoustic-only diarization probability values are set to fixed values.
  • Figure 5: Example of a data entry of Task 3.