Table of Contents
Fetching ...

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

TL;DR

ZeroSyl is proposed, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model, and outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks.

Abstract

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

TL;DR

ZeroSyl is proposed, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model, and outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks.

Abstract

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.
Paper Structure (14 sections, 1 equation, 3 figures, 2 tables)

This paper contains 14 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: ZeroSyl detects syllabic boundaries using prominence-based peak detection on features from layer 13 of a frozen WavLM. It then mean pools semantic features from layer 22 within the discovered boundaries and clusters these using spherical $K$-means. A language model is trained on these cluster IDs.
  • Figure 2: Peak detection on the smoothed L2 norms of framewise features gives our predicted syllable boundaries. The dashed lines indicate the ground truth syllables.
  • Figure 3: Scaling behavior of ZeroSyl compared to speech (SpidR with $K$-means) and text (BPE) systems on the Libri-Light corpus.