ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

Zefang Liu; Chenyang Zhu; Sangwoo Cho; Shi-Xiong Zhang

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang

TL;DR

ReHear is a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop, allowing it to recover phonetically accurate transcripts even from severe recognition errors.

Abstract

Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

TL;DR

Abstract

Paper Structure (14 sections, 1 figure, 5 tables, 1 algorithm)

This paper contains 14 sections, 1 figure, 5 tables, 1 algorithm.

Introduction
Related Work
Methodology
Experiments
Datasets
Data Preparation
Prompt Templates
Experimental Setup
Experimental Results
Ablation Studies
Modality and Position
Decoding Strategies
Iterative Dynamics
Conclusion

Figures (1)

Figure 1: Overview of the proposed ReHear framework. The pipeline employs an audio-aware LLM to refine hypotheses via multimodal context, subsequently using these refined pseudo-labels to iteratively fine-tune the ASR model on data.

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

TL;DR

Abstract

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)