Table of Contents
Fetching ...

Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition

Xiaodong Cui, A F M Saif, Songtao Lu, Lisha Chen, Tianyi Chen, Brian Kingsbury, George Saon

TL;DR

BL-JUST tackles the disconnect between unsupervised pre-training and supervised fine-tuning in ASR by casting training as a bilevel optimization problem where the upper-level objective minimizes the supervised loss while the lower-level objective minimizes the unsupervised loss. It employs a penalty-based bilevel gradient descent (PBGD) with a gradually increased penalty to enforce the lower-level solution as a constraint, thereby encouraging matched local optima of competing objectives. Across LibriSpeech, Switchboard, and Payload, BL-JUST consistently surpasses PT+FT and other semi-supervised baselines, and ablation shows the importance of self-supervised exploration and final fine-tuning. The approach accelerates convergence on the unsupervised loss and yields robust improvements across architectures and loss families, offering a practical path to more data-efficient ASR.

Abstract

In this paper, we propose a bilevel joint unsupervised and supervised training (BL-JUST) framework for automatic speech recognition. Compared to the conventional pre-training and fine-tuning strategy which is a disconnected two-stage process, BL-JUST tries to optimize an acoustic model such that it simultaneously minimizes both the unsupervised and supervised loss functions. Because BL-JUST seeks matched local optima of both loss functions, acoustic representations learned by the acoustic model strike a good balance between being generic and task-specific. We solve the BL-JUST problem using penalty-based bilevel gradient descent and evaluate the trained deep neural network acoustic models on various datasets with a variety of architectures and loss functions. We show that BL-JUST can outperform the widely-used pre-training and fine-tuning strategy and some other popular semi-supervised techniques.

Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition

TL;DR

BL-JUST tackles the disconnect between unsupervised pre-training and supervised fine-tuning in ASR by casting training as a bilevel optimization problem where the upper-level objective minimizes the supervised loss while the lower-level objective minimizes the unsupervised loss. It employs a penalty-based bilevel gradient descent (PBGD) with a gradually increased penalty to enforce the lower-level solution as a constraint, thereby encouraging matched local optima of competing objectives. Across LibriSpeech, Switchboard, and Payload, BL-JUST consistently surpasses PT+FT and other semi-supervised baselines, and ablation shows the importance of self-supervised exploration and final fine-tuning. The approach accelerates convergence on the unsupervised loss and yields robust improvements across architectures and loss families, offering a practical path to more data-efficient ASR.

Abstract

In this paper, we propose a bilevel joint unsupervised and supervised training (BL-JUST) framework for automatic speech recognition. Compared to the conventional pre-training and fine-tuning strategy which is a disconnected two-stage process, BL-JUST tries to optimize an acoustic model such that it simultaneously minimizes both the unsupervised and supervised loss functions. Because BL-JUST seeks matched local optima of both loss functions, acoustic representations learned by the acoustic model strike a good balance between being generic and task-specific. We solve the BL-JUST problem using penalty-based bilevel gradient descent and evaluate the trained deep neural network acoustic models on various datasets with a variety of architectures and loss functions. We show that BL-JUST can outperform the widely-used pre-training and fine-tuning strategy and some other popular semi-supervised techniques.

Paper Structure

This paper contains 17 sections, 2 theorems, 12 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

Under the above assumptions, with a prescribed accuracy $\delta\!>\!0$, set $\gamma\!\geq\!L\sqrt{3\mu\delta^{-1}}$. If $(\theta_\gamma,\phi_\gamma,\eta_\gamma)$ is a local/global solution of Eq. eqn:psingle, it is also a local/global solution of the following approximate problem of Eq. eqn:blp with

Figures (3)

  • Figure 1: An illustration of the two-stage pre-training followed by fine-tuning (PT+FT) in the upper panel and bilevel joint unsupervised and supervised training (BL-JUST) in the lower panel.
  • Figure 2: The network architecture for bilevel joint unsupervised and supervised training.
  • Figure 3: The unsupervised CPC loss (upper panel) and supervised CTC loss (lower panel) of PT+FT and BL-JUST on LibriSpeech using 860 hours of unlabeled data and 100 hours of labeled data (L/U: 100/860).

Theorems & Definitions (2)

  • Lemma 1
  • Theorem 1: Convergence rate of BL-JUST