Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Zijian Yang; Jörg Barkoczi; Ralf Schlüter; Hermann Ney

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Zijian Yang, Jörg Barkoczi, Ralf Schlüter, Hermann Ney

TL;DR

A single-stage sequence-level cross-entropy loss is proposed for unsupervised speech recognition based on a classification error bound derived from a theoretical framework for unsupervised speech recognition grounded in classification error bounds.

Abstract

Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

TL;DR

Abstract

Paper Structure (10 sections, 3 theorems, 24 equations, 1 figure)

This paper contains 10 sections, 3 theorems, 24 equations, 1 figure.

Introduction & Related Work
Classification Error Mismatch in ASR
Unsupervised Speech Recognition
Problem Statement
Sufficient Conditions for Unsupervised Training
Sequence-Level Unsupervised Training Criterion
Discussions
Necessity of the Full-Column Rank Condition
Necessity of the Structure Assumption
Conclusion

Key Result

Theorem 1

When $\mathbf{P}_C$ has full column rank, and the true distribution satisfies the structure assumption, the following inequality holds: where $\mathbf{P}_C^+ = ({\mathbf{P}^\top_C}\mathbf{P}_C)^{-1}\mathbf{P}^\top_C$ is the left-inverse of $\mathbf{P}_C$, and $\|\mathbf{P}_C^{+}\|_1$ is the induced $\ell_1$ norm of $\mathbf{P}_C^{+}$.

Figures (1)

Figure 1: Simulation result for sequence-level bound between sequence-level marginal distributions $\sum_{x_1^N}|pr(x_1^N) - q(x_1^N)|$ and $\overline{D}_q$. The simulation is done with $|\mathcal{X}|=4$, $|\mathcal{C}|=3$, $N=3$, $\|\mathbf{P}_C^+\|_1 \leq 2$. $\mathbf{P}_C$ is guaranteed to have full column rank by conditioning that the minimum singular value is larger than 0.01. The grey dots refer to simulation points.

Theorems & Definitions (3)

Theorem 1
Lemma 1
Lemma 2

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

TL;DR

Abstract

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (3)