Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Yonathan Ron; Shiri Gilboa; Tammuz Dubnov

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Yonathan Ron, Shiri Gilboa, Tammuz Dubnov

TL;DR

Results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

Abstract

Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

TL;DR

Results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

Abstract

Paper Structure (23 sections, 1 equation, 2 figures, 1 table)

This paper contains 23 sections, 1 equation, 2 figures, 1 table.

Introduction
Related Work
Problem Analysis and Dataset
NBA Commentary Dataset
Error Taxonomy
Methodology
Multi-Agent Pipeline Architecture
Whisper Initial Prompt Mechanism
Pipeline Variants
Why a natural-sentence prompt?
Implementation Details
Experimental Evaluation
Evaluation Framework
Results
Error Analysis and Success Cases
...and 8 more sections

Figures (2)

Figure 1: Original Whisper decoding format (adapted from the official documentation), annotated to highlight where the initial_prompt (red circle, previous text tokens) and prefix tokens (red boxes, time-aligned text tokens) enter the model’s input stream. In our work, we leverage the initial prompt channel to inject domain-specific context, enabling Whisper to bias its decoding toward correct player names and basketball jargon.
Figure 2: Full multi-agent pipeline flow (P4). Multiple agents extract candidate context from the first-pass transcript. Decider modules filter low-confidence or unnecessary items, and a sentence builder composes a final concise prompt for Whisper’s second pass.

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

TL;DR

Abstract

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)