Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
TL;DR
This work tackles domain-adaptive ASR when labeled in-domain data are scarce by combining an incremental semi-supervised learning pipeline with pseudo-label filtering and auxiliary related-domain data. The method initializes from a small seed labeled set, optionally augments it with related-domain data, and iteratively expands the unlabeled pool using filtered pseudo-labels—employing CER-based consensus or NER-based filtering to mitigate noise. Across Wow and Fisher English, the CER-based filtering consistently yields the largest gains, with incremental SSL outperforming single-step fine-tuning and approaching manual-label performance in some cases; NER filtering offers competitive results at lower computational cost. The findings demonstrate practical gains for multi-domain ASR, showing that targeted pseudo-label selection combined with incremental retraining can significantly improve performance when extensive labeled data are unavailable.
Abstract
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
