Table of Contents
Fetching ...

Revisiting speech segmentation and lexicon learning with better features

Herman Kamper, Benjamin van Niekerk

TL;DR

The paper tackles zero-resource speech segmentation and lexicon learning from unlabelled audio. It introduces a two-stage DPDP pipeline that uses HuBERT features for acoustic-unit discovery and a DPDP-compatible AE-RNN scorer for word-like segmentation, followed by K-means clustering of acoustic word embeddings to form a lexicon. Across five languages in ZeroSpeech Track 2, the approach achieves the best lexicon quality as measured by normalised edit distance (NED) and competitive boundary-based segmentation, indicating strong gains in lexicon learning. This work demonstrates that leveraging powerful self-supervised representations with simple clustering can yield scalable, high-quality lexicons for zero-resource speech technology.

Abstract

We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features. These embeddings are clustered using K-means to get a lexicon. The result is good full-coverage segmentation with a lexicon that achieves state-of-the-art performance on the ZeroSpeech benchmarks.

Revisiting speech segmentation and lexicon learning with better features

TL;DR

The paper tackles zero-resource speech segmentation and lexicon learning from unlabelled audio. It introduces a two-stage DPDP pipeline that uses HuBERT features for acoustic-unit discovery and a DPDP-compatible AE-RNN scorer for word-like segmentation, followed by K-means clustering of acoustic word embeddings to form a lexicon. Across five languages in ZeroSpeech Track 2, the approach achieves the best lexicon quality as measured by normalised edit distance (NED) and competitive boundary-based segmentation, indicating strong gains in lexicon learning. This work demonstrates that leveraging powerful self-supervised representations with simple clustering can yield scalable, high-quality lexicons for zero-resource speech technology.

Abstract

We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features. These embeddings are clustered using K-means to get a lexicon. The result is good full-coverage segmentation with a lexicon that achieves state-of-the-art performance on the ZeroSpeech benchmarks.
Paper Structure (5 sections, 1 equation, 1 figure, 1 table)

This paper contains 5 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Starting from kamper2023word, we replace CPC with a HuBERT clustering model for acoustic unit discovery (a). After word segmentation (b), we construct a lexicon through K-means clustering (d) on averaged HuBERT acoustic word embeddings (c).