Towards Unsupervised Speech Recognition Without Pronunciation Models
Junrui Ni, Liming Wang, Yang Zhang, Kaizhi Qian, Heting Gao, Mark Hasegawa-Johnson, Chang D. Yoo
TL;DR
This work tackles unsupervised automatic speech recognition at the word level without pronunciation lexicons by introducing joint speech-text token-infilling (JSTTI). The authors develop an iterative boundary refinement pipeline that combines word-level speech representations from HuBERT-based features with a Transformer-based JSTTI model, aided by differentiable boundary pooling and pseudo-text self-training. On synthetic LibriSpeech-like data with fixed vocabularies, JSTTI achieves competitive word error rates (around 20-23%) and outperforms prior lexicon-free baselines, with results extending to larger vocabularies through careful initialization and boundary refinement. The findings demonstrate a viable path toward pronunciation-model-free ASR in low-resource settings and provide a framework for cross-modal, word-level unsupervised learning and evaluation.
Abstract
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20-23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
