Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

A F M Saif; Xiaodong Cui; Han Shen; Songtao Lu; Brian Kingsbury; Tianyi Chen

Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

A F M Saif, Xiaodong Cui, Han Shen, Songtao Lu, Brian Kingsbury, Tianyi Chen

TL;DR

The paper addresses data scarcity and negative transfer in ASR by reframing training as a bilevel optimization problem that jointly learns unsupervised representations and supervised recognition. BL-JUST uses a penalty-based reformulation to couple the lower-level InfoNCE-based unsupervised objective with the upper-level CTC-based supervised objective, enabling feedback between stages within a single training loop. Under standard smoothness and PL conditions, the method achieves convergence to stationary points, with an iteration complexity of $\mathcal{O}(L_{\gamma}\epsilon^{-1})$. Empirical results on LibriSpeech and TED-LIUM v2 show that BL-JUST consistently outperforms the conventional PT+FT strategy and supervised baselines, while also reducing training time, demonstrating practical benefits for ASR with limited labeled data.

Abstract

In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. {BL-JUST employs a lower and upper level optimization with an unsupervised loss and a supervised loss respectively, leveraging recent advances in penalty-based bilevel optimization to solve this challenging ASR problem with affordable complexity and rigorous convergence guarantees.} To evaluate BL-JUST, extensive experiments on the LibriSpeech and TED-LIUM v2 datasets have been conducted. BL-JUST achieves superior performance over the commonly used pre-training followed by fine-tuning strategy.

Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

TL;DR

. Empirical results on LibriSpeech and TED-LIUM v2 show that BL-JUST consistently outperforms the conventional PT+FT strategy and supervised baselines, while also reducing training time, demonstrating practical benefits for ASR with limited labeled data.

Abstract

Paper Structure (12 sections, 2 theorems, 9 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 12 sections, 2 theorems, 9 equations, 2 figures, 6 tables, 1 algorithm.

Introduction
Problem Formulation
Bilevel optimization preliminaries
Bilevel optimization for acoustic model training
Joint Unsupervised and Supervised Training
Training
Convergence
Experiments
Experimental Setting
ASR Performance
Effect of penalty constant
Conclusions

Key Result

Lemma 1

Under Assumption assumption: QGEB, with a prescribed accuracy $\delta>0$, set $\gamma \geq L\sqrt{3\mu\delta^{-1}}$. If $(x_\gamma,y_\gamma)$ is a local/global solution of eq:penalized prob, it is also a local/global solution of the following approximate problem of eq:original prob with some $\epsil

Figures (2)

Figure 1: Comparison between the proposed BL-JUST training method (bottom) with the PT+FT method (upper).
Figure 2: Training losses of BL-JUST vs. PT+FT on 100 hours of speech in Librispeech. The acoustic model is conformer.

Theorems & Definitions (3)

Remark 1: Two-stage training versus BL-JUST training
Lemma 1: Equivalence of the penalized formulation
Theorem 1: Convergence rate of BL-JUST

Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

TL;DR

Abstract

Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (3)