PSST! Prosodic Speech Segmentation with Transformers

Nathan Roll; Calbert Graham; Simon Todd

PSST! Prosodic Speech Segmentation with Transformers

Nathan Roll, Calbert Graham, Simon Todd

TL;DR

PSST addresses automatic segmentation of intonation units in American English by re-purposing a pretrained STT transformer (Whisper) through low-frequency token retokenization. The method finetunes Whisper on IU boundaries and compares syntax-aware and lexical baselines, demonstrating strong in-distribution performance with $F1=0.87$ and $Acc=0.96$, and reasonable out-of-distribution results on IViE with $F1=0.73$ and $Acc=0.93$. The study shows that limited labeled data plus targeted signal reduction can yield state-of-the-art IU segmentation without enterprise-scale compute, and that prosody-syntax interplay underlies boundary decisions. These findings offer a practical baseline for prosodic segmentation and suggest avenues for extending transformer models to more nuanced prosodic phenomena.

Abstract

Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.

PSST! Prosodic Speech Segmentation with Transformers

TL;DR

and

, and reasonable out-of-distribution results on IViE with

and

. The study shows that limited labeled data plus targeted signal reduction can yield state-of-the-art IU segmentation without enterprise-scale compute, and that prosody-syntax interplay underlies boundary decisions. These findings offer a practical baseline for prosodic segmentation and suggest avenues for extending transformer models to more nuanced prosodic phenomena.

Abstract

Paper Structure (15 sections, 3 figures, 2 tables)

This paper contains 15 sections, 3 figures, 2 tables.

Introduction
Methods
Data
Models
Evaluation
Training
Inference
Signal Reduction Experiments
Results
Performance
Performance on IViE Corpus
Failure Cases
Filters
Discussion & Conclusion
Acknowledgments

Figures (3)

Figure 1: PSST Architecture
Figure 2: IU Length Distributions
Figure 3: Filter Frequency and Type

PSST! Prosodic Speech Segmentation with Transformers

TL;DR

Abstract

PSST! Prosodic Speech Segmentation with Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)