Table of Contents
Fetching ...

Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

A F M Saif, Xiaodong Cui, Han Shen, Songtao Lu, Brian Kingsbury, Tianyi Chen

TL;DR

The paper addresses data scarcity and negative transfer in ASR by reframing training as a bilevel optimization problem that jointly learns unsupervised representations and supervised recognition. BL-JUST uses a penalty-based reformulation to couple the lower-level InfoNCE-based unsupervised objective with the upper-level CTC-based supervised objective, enabling feedback between stages within a single training loop. Under standard smoothness and PL conditions, the method achieves convergence to stationary points, with an iteration complexity of $\mathcal{O}(L_{\gamma}\epsilon^{-1})$. Empirical results on LibriSpeech and TED-LIUM v2 show that BL-JUST consistently outperforms the conventional PT+FT strategy and supervised baselines, while also reducing training time, demonstrating practical benefits for ASR with limited labeled data.

Abstract

In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. {BL-JUST employs a lower and upper level optimization with an unsupervised loss and a supervised loss respectively, leveraging recent advances in penalty-based bilevel optimization to solve this challenging ASR problem with affordable complexity and rigorous convergence guarantees.} To evaluate BL-JUST, extensive experiments on the LibriSpeech and TED-LIUM v2 datasets have been conducted. BL-JUST achieves superior performance over the commonly used pre-training followed by fine-tuning strategy.

Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

TL;DR

The paper addresses data scarcity and negative transfer in ASR by reframing training as a bilevel optimization problem that jointly learns unsupervised representations and supervised recognition. BL-JUST uses a penalty-based reformulation to couple the lower-level InfoNCE-based unsupervised objective with the upper-level CTC-based supervised objective, enabling feedback between stages within a single training loop. Under standard smoothness and PL conditions, the method achieves convergence to stationary points, with an iteration complexity of . Empirical results on LibriSpeech and TED-LIUM v2 show that BL-JUST consistently outperforms the conventional PT+FT strategy and supervised baselines, while also reducing training time, demonstrating practical benefits for ASR with limited labeled data.

Abstract

In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. {BL-JUST employs a lower and upper level optimization with an unsupervised loss and a supervised loss respectively, leveraging recent advances in penalty-based bilevel optimization to solve this challenging ASR problem with affordable complexity and rigorous convergence guarantees.} To evaluate BL-JUST, extensive experiments on the LibriSpeech and TED-LIUM v2 datasets have been conducted. BL-JUST achieves superior performance over the commonly used pre-training followed by fine-tuning strategy.
Paper Structure (12 sections, 2 theorems, 9 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 12 sections, 2 theorems, 9 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Lemma 1

Under Assumption assumption: QGEB, with a prescribed accuracy $\delta>0$, set $\gamma \geq L\sqrt{3\mu\delta^{-1}}$. If $(x_\gamma,y_\gamma)$ is a local/global solution of eq:penalized prob, it is also a local/global solution of the following approximate problem of eq:original prob with some $\epsil

Figures (2)

  • Figure 1: Comparison between the proposed BL-JUST training method (bottom) with the PT+FT method (upper).
  • Figure 2: Training losses of BL-JUST vs. PT+FT on 100 hours of speech in Librispeech. The acoustic model is conformer.

Theorems & Definitions (3)

  • Remark 1: Two-stage training versus BL-JUST training
  • Lemma 1: Equivalence of the penalized formulation
  • Theorem 1: Convergence rate of BL-JUST