Table of Contents
Fetching ...

A Simple HMM with Self-Supervised Representations for Phone Segmentation

Gene-Ping Yang, Hao Tang

TL;DR

It is shown that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches, and a simple hidden Markov model is proposed that uses self-supervised representations and features at the boundaries for phone segmentation.

Abstract

Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.

A Simple HMM with Self-Supervised Representations for Phone Segmentation

TL;DR

It is shown that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches, and a simple hidden Markov model is proposed that uses self-supervised representations and features at the boundaries for phone segmentation.

Abstract

Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.
Paper Structure (12 sections, 7 equations, 2 figures, 4 tables)

This paper contains 12 sections, 7 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Peak detection using Mel spectrogram on the sample utterance fadg0_sx289 from TIMIT. From top to bottom: Mel spectrogram, spectral variations, and ground truth phone segments.
  • Figure 2: Comparison of the detected boundaries by different HMMs using HuBERT features on fadg0_si1909 from TIMIT.