Table of Contents
Fetching ...

Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

Avishai Elmakies, Omri Abend, Yossi Adi

TL;DR

This work tackles unsupervised speech segmentation by shifting focus from spectral boundaries to acoustic-semantic style changes, such as emotion and gender. It introduces a three-stage pipeline (sentencer, scorer, span selector) that leverages Speech Language Models to discretize speech into tokens and compute a PMI-based boundary score, enabling flexible span selection (C(k), A(v), T(t)). Evaluations on EmoV-DB and IEMOCAP with synthetic concatenations show improvements in boundary detection and reduced over-segmentation compared with baselines, though some metrics like PC-F1 favor simpler baselines under certain biases. The approach advances unsupervised, multi-style segmentation and has potential impacts on hierarchical speech models and spoken language understanding, while acknowledging limitations in inference time and dataset scope.

Abstract

In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.

Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

TL;DR

This work tackles unsupervised speech segmentation by shifting focus from spectral boundaries to acoustic-semantic style changes, such as emotion and gender. It introduces a three-stage pipeline (sentencer, scorer, span selector) that leverages Speech Language Models to discretize speech into tokens and compute a PMI-based boundary score, enabling flexible span selection (C(k), A(v), T(t)). Evaluations on EmoV-DB and IEMOCAP with synthetic concatenations show improvements in boundary detection and reduced over-segmentation compared with baselines, though some metrics like PC-F1 favor simpler baselines under certain biases. The approach advances unsupervised, multi-style segmentation and has potential impacts on hierarchical speech models and spoken language understanding, while acknowledging limitations in inference time and dataset scope.

Abstract

In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.
Paper Structure (13 sections, 4 equations, 2 figures, 2 tables)

This paper contains 13 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The proposed pipeline. Speech utterance is first segmented into $m$ segments ("acoustic-sentences") using the sentencer, the SLM scores consecutive segments and the span selector selects the ones to merge.
  • Figure 2: Ablation results. Subfigures present the effect of different ablations on PR-F1 and R-Val. All results use the 350M model and are shown using EmoV-DB (E represents the "change of emotion" experiment and G represents the "change of gender" experiment) PC-F1 has a high positive correlation (0.55) with PR-F1 and is therefore omitted for brevity.