Towards Unsupervised Speech Recognition Without Pronunciation Models

Junrui Ni; Liming Wang; Yang Zhang; Kaizhi Qian; Heting Gao; Mark Hasegawa-Johnson; Chang D. Yoo

Towards Unsupervised Speech Recognition Without Pronunciation Models

Junrui Ni, Liming Wang, Yang Zhang, Kaizhi Qian, Heting Gao, Mark Hasegawa-Johnson, Chang D. Yoo

TL;DR

This work tackles unsupervised automatic speech recognition at the word level without pronunciation lexicons by introducing joint speech-text token-infilling (JSTTI). The authors develop an iterative boundary refinement pipeline that combines word-level speech representations from HuBERT-based features with a Transformer-based JSTTI model, aided by differentiable boundary pooling and pseudo-text self-training. On synthetic LibriSpeech-like data with fixed vocabularies, JSTTI achieves competitive word error rates (around 20-23%) and outperforms prior lexicon-free baselines, with results extending to larger vocabularies through careful initialization and boundary refinement. The findings demonstrate a viable path toward pronunciation-model-free ASR in low-resource settings and provide a framework for cross-modal, word-level unsupervised learning and evaluation.

Abstract

Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20-23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.

Towards Unsupervised Speech Recognition Without Pronunciation Models

TL;DR

Abstract

Paper Structure (33 sections, 30 equations, 13 figures, 15 tables, 5 algorithms)

This paper contains 33 sections, 30 equations, 13 figures, 15 tables, 5 algorithms.

Introduction
Background
Unsupervised Speech Recognition
wav2vec-U and REBORN
Position Unigram and Skipgram Matching
Speech Representation Learning
HuBERT
VG-HuBERT
SpeechT5
Unsupervised Word Segmentation for Speech
GradSeg
XLS-R Fine-tuning on Noisy Word Boundaries
Segmental-CPC and Segmental PUSM
Model Architecture, Training, and Refinement
Joint Speech-Text Token-infilling with Transformer
...and 18 more sections

Figures (13)

Figure 1: Illustration of Joint Speech-Text Token Infilling during training. The encoder-only model is shared between speech-to-speech token infilling (left branch) and text-to-text token infilling (right branch). If $N$ consecutive tokens are masked on the input side, they are replaced with $N$$<\!\text{MASK}\!>$ tokens (denoted as [M] in the figure) or with samples randomly drawn from the modality-specific vocabulary. The word-level speech token extraction process is detailed in Section \ref{['subsec:extract_word_level_features']}.
Figure 2: Whole-word ASR inference process after JSTTI training. The encoder takes speech tokens as input and predicts a text token for each token representation.
Figure 3: Word-level speech token extraction. The three-step speech feature extraction routine converts continuous speech into discrete, word-level tokens. Step 3.1 only uses word-level features extracted from VG-HuBERT for training the k-means quantizer, even if the segmentation model $S$ used in Step 2 is not VG-HuBERT.
Figure 4: Behavior-cloning stage of the CNN segmeter. We train a simple CNN classifier that takes frame-level speech features as input and jointly predicts the frame-level Gradseg + wav2bnd unsupervised word boundary targets and the frame-level acoustic cluster labels as outputs. In the figure, CE stands for cross-entropy loss and BCE stands for binary cross-entropy loss.
Figure 5: Illustraining example of the differentiable soft-pooler. This example shows how the differentiable soft-pooler obtains word-level features from the boundary probability predictions of the CNN segmenter and how the speech discrete cluster sequence is obtained in a differentiable manner. The variables in the figure are referenced in the description. The symbol "$\sim$’’ means "a real number close to,’’ e.g., "$\sim2$’’ denotes a real number between 1.5 and 2.5. Note that the CNN segmenter is trainable, while the differentiable k-means quantizer is not.
...and 8 more figures

Towards Unsupervised Speech Recognition Without Pronunciation Models

TL;DR

Abstract

Towards Unsupervised Speech Recognition Without Pronunciation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)