Table of Contents
Fetching ...

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Simon Malan, Benjamin van Niekerk, Herman Kamper

TL;DR

Unsupervised word discovery is addressed via a two-stage pipeline that first detects word boundaries from the dissimilarity $f_t = d(\mathbf{y}_{t+1}, \mathbf{y}_{t})$ on HuBERT features, then builds a lexicon by clustering averaged segment embeddings in $\mathbb{R}^M$. The approach is benchmarked against dynamic-programming baselines, including an updated ES-KMeans+ variant, on five languages from ZeroSpeech Track 2, showing competitive performance with a substantial speedup (approximately 4–5x). The study highlights the critical roles of boundary quality, self-supervised feature choice, and language-specific pre-training for effective lexicon learning in unsupervised speech. These insights support scalable, high-performance word discovery in low-resource settings.

Abstract

We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster. Project webpage: https://s-malan.github.io/prom-seg-clus.

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

TL;DR

Unsupervised word discovery is addressed via a two-stage pipeline that first detects word boundaries from the dissimilarity on HuBERT features, then builds a lexicon by clustering averaged segment embeddings in . The approach is benchmarked against dynamic-programming baselines, including an updated ES-KMeans+ variant, on five languages from ZeroSpeech Track 2, showing competitive performance with a substantial speedup (approximately 4–5x). The study highlights the critical roles of boundary quality, self-supervised feature choice, and language-specific pre-training for effective lexicon learning in unsupervised speech. These insights support scalable, high-performance word discovery in low-resource settings.

Abstract

We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster. Project webpage: https://s-malan.github.io/prom-seg-clus.
Paper Structure (9 sections, 3 figures, 3 tables)

This paper contains 9 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An example of word boundaries from the prominence-based approach of ankita_tti. The red (dark) line is the dissimilarity curve between adjacent frames, which is smoothed to produce the white line. The crosses are the predicted boundaries. The black vertical lines are the ground truth boundaries.
  • Figure 2: Our lexicon building step. After extracting frame-level features (a), PCA dimensionality reduction is applied (b). For each segment from the prominence-based approach (Fig. \ref{['fig:tti']}), an averaged embedding is obtained (c). These are $K$-means clustered (d) to get a lexicon.
  • Figure 3: Normalized edit distance (%) of our prominence-based approach when swapping English HuBERT features for multilingual HuBERT (mHuBERT) features.