Table of Contents
Fetching ...

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM

Jeena Prakash, Blessingh Kumar, Kadri Hacioglu, Bidisha Sharma, Sindhuja Gopalan, Malolan Chetlur, Shankar Venkatesan, Andreas Stolcke

TL;DR

This work tackles the challenge of generating high-quality pseudo-labels for unlabeled audio by replacing traditional multi-stage ensemble pipelines with unified, prompt-driven architectures that leverage large language models. It compares a three-way spectrum of approaches: a classic multi-ASR ensemble, a textual LLM-based postprocessing method, and a speechLLM-based postprocessing method that incorporates acoustic evidence. The study demonstrates that textual LLM postprocessing improves transcription accuracy over the ensemble, and that speechLLM postprocessing yields further gains, especially on domain-adapted data, achieving near- or below-ground-truth performance in some settings. By casting pseudo-label generation and error correction as instruction-following tasks and using efficient finetuning (e.g., QLoRA and adapters), the authors achieve simpler, more robust semi-supervised ASR training with strong cross-domain performance.

Abstract

Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines.

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM

TL;DR

This work tackles the challenge of generating high-quality pseudo-labels for unlabeled audio by replacing traditional multi-stage ensemble pipelines with unified, prompt-driven architectures that leverage large language models. It compares a three-way spectrum of approaches: a classic multi-ASR ensemble, a textual LLM-based postprocessing method, and a speechLLM-based postprocessing method that incorporates acoustic evidence. The study demonstrates that textual LLM postprocessing improves transcription accuracy over the ensemble, and that speechLLM postprocessing yields further gains, especially on domain-adapted data, achieving near- or below-ground-truth performance in some settings. By casting pseudo-label generation and error correction as instruction-following tasks and using efficient finetuning (e.g., QLoRA and adapters), the authors achieve simpler, more robust semi-supervised ASR training with strong cross-domain performance.

Abstract

Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines.

Paper Structure

This paper contains 15 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of different approaches for generating pseudo labels, (a) Multi-ASR ensemble pipeline, (b) Multi-ASR textual LLM-based architecture, (c) Multi-ASR speechLLM-based architecture.
  • Figure 2: Multi-ASR ensemble pipeline for pseudo-labeling.