Table of Contents
Fetching ...

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

TL;DR

The paper tackles the challenge of domain-adapting ASR with limited labeled data and computational resources by introducing a robust data-selection pipeline that filters pseudo-labels generated by Whisper and Zipformer. It combines WER-prediction, NER-based selection, and CER-based filtering to curate a small, high-quality subset of pseudo-labeled data for fine-tuning. Empirical results on Wow and Fisher English show that 1–5% of pseudo-labeled data, chosen with CER-based or multi-criteria selection, can match or surpass full-dataset fine-tuning, enabling efficient domain adaptation. The approach is practical for real-world deployments where annotation and compute budgets are constrained, and it adapts to evolving acoustic and lexical properties to maintain accuracy.

Abstract

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

TL;DR

The paper tackles the challenge of domain-adapting ASR with limited labeled data and computational resources by introducing a robust data-selection pipeline that filters pseudo-labels generated by Whisper and Zipformer. It combines WER-prediction, NER-based selection, and CER-based filtering to curate a small, high-quality subset of pseudo-labeled data for fine-tuning. Empirical results on Wow and Fisher English show that 1–5% of pseudo-labeled data, chosen with CER-based or multi-criteria selection, can match or surpass full-dataset fine-tuning, enabling efficient domain adaptation. The approach is practical for real-world deployments where annotation and compute budgets are constrained, and it adapts to evolving acoustic and lexical properties to maintain accuracy.

Abstract

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

Paper Structure

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The functional blocks of the ASR Domain Adaptation process: (1) Pseudo-label (PL) generation using Whisper and Zipformer, where the corresponding ASR model is fine-tuned with the generated pseudo-labels. (2) Data selection pipeline consisting of: (a) Random selection, (b) WER prediction with SVM using HuBERT and XLM-R, (c) NER-based selection using a BERT model, and (d) CER-based filtering, where Nemo's Parkeet model is specifically employed along with whisper and zipformer for selecting data in the CER-based filtering step. ASR fine-tuning is carried out on the (3) full training set using the generated pseudo-labels and (4) the filtered data.
  • Figure 2: NER Entity Class Distribution with Confidence Levels. Each bar represents a different filtering method: Total Available Data (leftmost) shows the overall entity class distribution in grayscale, segmented by confidence levels (low, mid, high). Random Selection (100 hrs) maintains the same confidence distribution but selects segments randomly. High Confidence (100 hrs) prioritizes segments with the highest NER confidence scores ($>$0.8). Entity Class Balanced Selection (rightmost two bars) ensures proportional representation of entity classes while choosing segments either randomly or with high confidence.