Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition
Ognjen Kundacina, Vladimir Vincan, Dragisa Miskovic
TL;DR
This work tackles the labeling bottleneck in end-to-end ASR by proposing a two-stage active learning pipeline that first forms a diverse labeled seed using x-vector embeddings clustered with DBSCAN, and then iteratively improves the model with a batch, Bayesian AL strategy based on MC dropout to quantify uncertainty via WER. The approach leverages wav2vec 2.0 XLS-R as the base ASR model, with the first stage ensuring broad coverage across speaker and condition variations and the second stage focusing on informative, diverse samples from across x-vector clusters. Empirical results show strong gains on underrepresented speaker groups and robust OOD performance on VoxPopuli, while remaining competitive on standard benchmarks, and illustrate that combining diversity-driven seed selection with uncertainty-driven refinement yields high data efficiency. Overall, the method demonstrates substantial data utilization improvements for deep learning-based ASR, reducing labeling needs without sacrificing accuracy.
Abstract
This paper introduces a novel two-stage active learning (AL) pipeline for automatic speech recognition (ASR), combining unsupervised and supervised AL methods. The first stage utilizes unsupervised AL by using x-vectors clustering for diverse sample selection from unlabeled speech data, thus establishing a robust initial dataset for the subsequent supervised AL. The second stage incorporates a supervised AL strategy, with a batch AL method specifically developed for ASR, aimed at selecting diverse and informative batches of samples. Here, sample diversity is also achieved using x-vectors clustering, while the most informative samples are identified using a Bayesian AL method tailored for ASR with an adaptation of Monte Carlo dropout to approximate Bayesian inference. This approach enables precise uncertainty estimation, thereby enhancing ASR model training with significantly reduced data requirements. Our method has shown superior performance compared to competing methods on homogeneous, heterogeneous, and OOD test sets, demonstrating that strategic sample selection and innovative Bayesian modeling can substantially optimize both labeling effort and data utilization in deep learning-based ASR applications.
