SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Alan Baade; Puyuan Peng; David Harwath

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Alan Baade, Puyuan Peng, David Harwath

TL;DR

This work introduces a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information and successfully trains SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks.

Abstract

Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

TL;DR

Abstract

Paper Structure (29 sections, 5 equations, 9 figures, 7 tables)

This paper contains 29 sections, 5 equations, 9 figures, 7 tables.

Introduction
Related Work
Self-Supervised Encoder Models
Applications of Neural Codecs
Extracting Semantic Units from Raw Data
Learning Self-Supervised, Syllable-Like Representations from Raw Speech
LossPred: Extracting Syllable-like Segmentation from Relations in HuBERT's Loss
SylBoost: Bootstrapping Pesudo-Syllabic Units with Iterative Distillation
Efficient Extraction of Unit Boundaries for SylBoost
SyllableLM: Speech Language Modeling on Coarse Units
Language Model
Token to Speech Decoding
Experiments
Training Datasets
SylBoost Unit Configurations
...and 14 more sections

Figures (9)

Figure 1: Left-Top: The loss prediction matrix $C$, where brighter is higher likelihood placed on the teacher label. A time-aligned transcript is on the bottom, and predicted cluster unit boundaries span vertically as dashed-lines. Left-Bottom: A Mel-Spectrogram of the input waveform with an example masked timespan in gray. The losses on tokens at timesteps covered by the solid blue and dotted red spans are mapped to their corresponding rows and columns in $C$ as described in Section \ref{['sec:LossPred']}. Right: Visual of SylBoost. We train a student to match intermediate teacher features pooled over regions generated by pseudo-syllable-boundaries. We use a min-cut algorithm on the feature self-similarity matrix to extract boundaries, and then apply K-Means and Agglomerative clustering to obtain discrete units.
Figure 2: Qualitative results on SylBoost controllability for boundary detection. We plot the feature similarity matrix $A$, described in \ref{['sec:agglomeration']} for HuBERT, Data2Vec2, and SylBoost on Data2Vec2 when trained at different unit rates. The number of cuts $k$ is selected dynamically as described in \ref{['sec:tokenizer_exp']}.
Figure 3: 8.33Hz
Figure 4: 6.25Hz
Figure 5: 5.0Hz
...and 4 more figures

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

TL;DR

Abstract

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)