Table of Contents
Fetching ...

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Alan Baade, Puyuan Peng, David Harwath

TL;DR

This work introduces a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information and successfully trains SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks.

Abstract

Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

TL;DR

This work introduces a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information and successfully trains SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks.

Abstract

Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
Paper Structure (29 sections, 5 equations, 9 figures, 7 tables)

This paper contains 29 sections, 5 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Left-Top: The loss prediction matrix $C$, where brighter is higher likelihood placed on the teacher label. A time-aligned transcript is on the bottom, and predicted cluster unit boundaries span vertically as dashed-lines. Left-Bottom: A Mel-Spectrogram of the input waveform with an example masked timespan in gray. The losses on tokens at timesteps covered by the solid blue and dotted red spans are mapped to their corresponding rows and columns in $C$ as described in Section \ref{['sec:LossPred']}. Right: Visual of SylBoost. We train a student to match intermediate teacher features pooled over regions generated by pseudo-syllable-boundaries. We use a min-cut algorithm on the feature self-similarity matrix to extract boundaries, and then apply K-Means and Agglomerative clustering to obtain discrete units.
  • Figure 2: Qualitative results on SylBoost controllability for boundary detection. We plot the feature similarity matrix $A$, described in \ref{['sec:agglomeration']} for HuBERT, Data2Vec2, and SylBoost on Data2Vec2 when trained at different unit rates. The number of cuts $k$ is selected dynamically as described in \ref{['sec:tokenizer_exp']}.
  • Figure 3: 8.33Hz
  • Figure 4: 6.25Hz
  • Figure 5: 5.0Hz
  • ...and 4 more figures