Table of Contents
Fetching ...

Towards Unsupervised Speech Recognition at the Syllable-Level

Liming Wang, Junrui Ni, Kai-Wei Chang, Saurabhchand Bhati, David Harwath, Mark Hasegawa-Johnson, James R. Glass

TL;DR

SylCipher addresses unsupervised speech recognition without G2P by modeling at the syllable level and applying information-constrained masked language modeling. It probes a unified encoder architecture with differentiable syllabification, entropy control, and distribution matching (including PUSM), achieving state-of-the-art results in G2P-free UASR across LibriSpeech, SpokenCOCO, and AISHELL-3, with notable improvements in Mandarin. The approach yields up to a $40\%$ relative $CER$ reduction on LibriSpeech and strong cross-domain generalization, demonstrating the practical viability of syllable-level units for language-universal UASR and robust boundary detection. The work also provides theoretical guarantees for distribution alignment and zero-error UASR under regularity, offering a promising direction for inclusive spoken-language technology without linguistic resource bottlenecks.

Abstract

Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.

Towards Unsupervised Speech Recognition at the Syllable-Level

TL;DR

SylCipher addresses unsupervised speech recognition without G2P by modeling at the syllable level and applying information-constrained masked language modeling. It probes a unified encoder architecture with differentiable syllabification, entropy control, and distribution matching (including PUSM), achieving state-of-the-art results in G2P-free UASR across LibriSpeech, SpokenCOCO, and AISHELL-3, with notable improvements in Mandarin. The approach yields up to a relative reduction on LibriSpeech and strong cross-domain generalization, demonstrating the practical viability of syllable-level units for language-universal UASR and robust boundary detection. The work also provides theoretical guarantees for distribution alignment and zero-error UASR under regularity, offering a promising direction for inclusive spoken-language technology without linguistic resource bottlenecks.

Abstract

Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.

Paper Structure

This paper contains 25 sections, 3 theorems, 29 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Suppose $(f_{\tilde{X}}^*, g_{\tilde{X}}^*, f_Y^*, g_Y^*)$ minimize equation eq:regularized_distribution_matching, then under certain assumptions (See Appendix app:proof_of_main), $f_{{\tilde{X}}}^*$ and $f_Y^*$ are invertible and $q_{Y|X}^*(y|x):=\mathbbm{1}[f_Y^{*-1}\circ f_{\tilde{X}}^*\circ c \c

Figures (5)

  • Figure 1: Overall architecture of SylCipher. Gray boxes are fixed during training. (a) MLM-based stages: learn a compressed joint semantic space with a shared encoder and random mix-up. (b) PUSM stage: align speech and text spaces by matching lower-order marginals of their distributions.
  • Figure 2: Speech syllabifier
  • Figure 3: Ablation studies on the effect of syllabifier type, pooling type and token vocabulary size on SylCipher UASR performance. (a) The effect of different pooler named after the type of $\sigma_{\epsilon}$ function used in equation \ref{['eq:softpooler']}.(b) The effect of the vocabulary size of non-<OOV>-tokens on SylCipher performance during the fixed-boundary stage on LibriSpeech (matched). (c) Effect of syllabifiers with difference resource requirements on SylCipher performance during the fixed-boundary stage on LibriSpeech (matched).
  • Figure 4: Spectrograms of audio examples in our test split of LibriSpeech clean subsets (matched setting) and the predicted speech-text alignment by SylCipher after different training stages. Audios are truncated to the 3-second mark for better visualization. The alignment bars from top to bottom: Forced alignment, Sylber, Sylber+JE2E, Sylber+JE2E+PUSM.
  • Figure 5: Spectrograms of audio examples in our test split of LibriSpeech clean subsets (matched setting) and the predicted speech-text alignment by SylCipher after different training stages. Audios are truncated to the 3-second mark for better visualization. The alignment bars from top to bottom: Forced alignment, Sylber, Sylber+JE2E, Sylber+JE2E+PUSM.

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • proof