Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming
Simon Malan, Benjamin van Niekerk, Herman Kamper
TL;DR
Unsupervised word discovery is addressed via a two-stage pipeline that first detects word boundaries from the dissimilarity $f_t = d(\mathbf{y}_{t+1}, \mathbf{y}_{t})$ on HuBERT features, then builds a lexicon by clustering averaged segment embeddings in $\mathbb{R}^M$. The approach is benchmarked against dynamic-programming baselines, including an updated ES-KMeans+ variant, on five languages from ZeroSpeech Track 2, showing competitive performance with a substantial speedup (approximately 4–5x). The study highlights the critical roles of boundary quality, self-supervised feature choice, and language-specific pre-training for effective lexicon learning in unsupervised speech. These insights support scalable, high-performance word discovery in low-resource settings.
Abstract
We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster. Project webpage: https://s-malan.github.io/prom-seg-clus.
