Table of Contents
Fetching ...

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen, Alan W Black, Gopala K. Anumanchipalli

TL;DR

Sylber advances speech representations by enforcing syllabic structure in a self-supervised framework, addressing the inefficiency of dense, sub-phonemic SSL tokens. Through self-segmentation distillation and a linear-time greedy segmentation algorithm, it yields compact syllabic tokens (≈4–5 syllables per second) that support intelligible resynthesis and effective language modeling at reduced bitrates. The approach generalizes across unseen languages and domains, and exhibits emergent categorical perception in its embedding space, suggesting efficient tokenization grounded in phonology. These findings point to significant potential for scalable, efficient spoken language processing and modeling. $O(n)$ segmentation and categorical embeddings collectively enable faster, more interpretable speech representations with practical impact on ASR, synthesis, and multilingual NLP tasks.

Abstract

Syllables are compositional units of spoken language that efficiently structure human speech perception and production. However, current neural speech representations lack such structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised learning (SSL) framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling. Our proposed segmentation method is highly robust and generalizes to out-of-domain data and unseen languages without any tuning. By training token-to-speech generative models, fully intelligible speech can be reconstructed from Sylber tokens with a significantly lower bitrate than baseline SSL tokens. This suggests that our model effectively compresses speech into a compact sequence of tokens with minimal information loss. Lastly, we demonstrate that categorical perception-a linguistic phenomenon in speech perception-emerges naturally in Sylber, making the embedding space more categorical and sparse than previous speech features and thus supporting the high efficiency of our tokenization. Together, we present a novel SSL approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

TL;DR

Sylber advances speech representations by enforcing syllabic structure in a self-supervised framework, addressing the inefficiency of dense, sub-phonemic SSL tokens. Through self-segmentation distillation and a linear-time greedy segmentation algorithm, it yields compact syllabic tokens (≈4–5 syllables per second) that support intelligible resynthesis and effective language modeling at reduced bitrates. The approach generalizes across unseen languages and domains, and exhibits emergent categorical perception in its embedding space, suggesting efficient tokenization grounded in phonology. These findings point to significant potential for scalable, efficient spoken language processing and modeling. segmentation and categorical embeddings collectively enable faster, more interpretable speech representations with practical impact on ASR, synthesis, and multilingual NLP tasks.

Abstract

Syllables are compositional units of spoken language that efficiently structure human speech perception and production. However, current neural speech representations lack such structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised learning (SSL) framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling. Our proposed segmentation method is highly robust and generalizes to out-of-domain data and unseen languages without any tuning. By training token-to-speech generative models, fully intelligible speech can be reconstructed from Sylber tokens with a significantly lower bitrate than baseline SSL tokens. This suggests that our model effectively compresses speech into a compact sequence of tokens with minimal information loss. Lastly, we demonstrate that categorical perception-a linguistic phenomenon in speech perception-emerges naturally in Sylber, making the embedding space more categorical and sparse than previous speech features and thus supporting the high efficiency of our tokenization. Together, we present a novel SSL approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

Paper Structure

This paper contains 37 sections, 1 equation, 9 figures, 15 tables, 1 algorithm.

Figures (9)

  • Figure 1: Self-segmentation distillation.
  • Figure 2: Frame-wise similarity matrix of raw features measured by dot product. For HuBERT and SDHuBERT, features from the ninth Transformer layer are extracted. Sylber shows extremely salient syllabic structure that is aligned with the ground truth syllable boundaries, with clear null activations in non-speech frames.
  • Figure 3: A. Overview of articulatory interpolation of rhyming words when interpolating $\alpha \in [0,1]$. B. Hypothetical curves of categorical (solid lines) and non-categorical (dashed lines) embeddings. C. Similarity curves examples from Melspectrogram (Mel), HuBERT, and Sylber. Sylber consistently shows highly categorical perception, drawing a sharp boundary in continuum between words.
  • Figure 4: Frame-wise similarity matrix with and without denosing objectives, using clean signal (left two panels) and noisy signal (right two panel). The orange waveform depicts the source noise we add to the clean speech signal.
  • Figure 5: Frame-wise similarity matrices of Sylber applied to samples from OOD datasets: Fisher (top), Spanish (middle), and Mandarin (bottom). The dot product is applied to raw features to measure similarity. We can see highly prominent syllabic segments in all OOD cases.
  • ...and 4 more figures