Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition

Ognjen Kundacina; Vladimir Vincan; Dragisa Miskovic

Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition

Ognjen Kundacina, Vladimir Vincan, Dragisa Miskovic

TL;DR

This work tackles the labeling bottleneck in end-to-end ASR by proposing a two-stage active learning pipeline that first forms a diverse labeled seed using x-vector embeddings clustered with DBSCAN, and then iteratively improves the model with a batch, Bayesian AL strategy based on MC dropout to quantify uncertainty via WER. The approach leverages wav2vec 2.0 XLS-R as the base ASR model, with the first stage ensuring broad coverage across speaker and condition variations and the second stage focusing on informative, diverse samples from across x-vector clusters. Empirical results show strong gains on underrepresented speaker groups and robust OOD performance on VoxPopuli, while remaining competitive on standard benchmarks, and illustrate that combining diversity-driven seed selection with uncertainty-driven refinement yields high data efficiency. Overall, the method demonstrates substantial data utilization improvements for deep learning-based ASR, reducing labeling needs without sacrificing accuracy.

Abstract

This paper introduces a novel two-stage active learning (AL) pipeline for automatic speech recognition (ASR), combining unsupervised and supervised AL methods. The first stage utilizes unsupervised AL by using x-vectors clustering for diverse sample selection from unlabeled speech data, thus establishing a robust initial dataset for the subsequent supervised AL. The second stage incorporates a supervised AL strategy, with a batch AL method specifically developed for ASR, aimed at selecting diverse and informative batches of samples. Here, sample diversity is also achieved using x-vectors clustering, while the most informative samples are identified using a Bayesian AL method tailored for ASR with an adaptation of Monte Carlo dropout to approximate Bayesian inference. This approach enables precise uncertainty estimation, thereby enhancing ASR model training with significantly reduced data requirements. Our method has shown superior performance compared to competing methods on homogeneous, heterogeneous, and OOD test sets, demonstrating that strategic sample selection and innovative Bayesian modeling can substantially optimize both labeling effort and data utilization in deep learning-based ASR applications.

Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition

TL;DR

Abstract

Paper Structure (14 sections, 18 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 14 sections, 18 equations, 7 figures, 4 tables, 2 algorithms.

Introduction
Theoretical Preliminaries
Automatic Speech Recognition
X-Vectors
Deep Active Learning
Proposed Approach
First Stage: Unsupervised Active Learning
Second Stage: Supervised Active Learning
Results and Discussion
First Stage - Unsupervised Active Learning
Second Stage - Supervised Active Learning
Results on OOD Test Set
Results on a Standard ASR Evaluation Benchmark
Conclusion

Figures (7)

Figure 1: The proposed two-stage active learning pipeline. The first, unsupervised active learning stage is based on x-vectors clustering. The second, supervised active learning stage combines x-vector-based batch active learning with Bayesian active learning via Monte Carlo dropout approximation.
Figure 2: The upper plot displays x-vectors and the lower plot shows i-vectors of speech recordings from two speakers, both reduced to two dimensions using PCA.
Figure 3: The relationship between WER and the proposed uncertainty on test set samples. The red line illustrates a linear fit to the provided relationship.
Figure 4: Primary test set WER (%) for trained ASR models over AL iterations, comparing the proposed approach, SMCA, random sampling, isolated first stage, isolated second stage, and the whole dataset baseline. Each AL iteration adds approximately 1 hour of labeled training data, with the whole dataset totaling 17.31 hours.
Figure 5: The distribution of uncertainties calculated using the proposed approach for all unlabeled samples in each AL iteration.
...and 2 more figures

Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition

TL;DR

Abstract

Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)