Table of Contents
Fetching ...

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Héctor Javier Vázquez Martínez

Abstract

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Abstract

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

Paper Structure

This paper contains 15 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Syllable boundary segmentation across datasets. (a) Boundary F1 by method and dataset for envelope baselines and default published neural configurations (higher is better; Ornat-Swingley excluded due to missing syllable intervals). (b) Sylber component effect: $\Delta$F1 when swapping the segmenter (CosThresh $\rightarrow$peakdetect) with the cosine-similarity cue held fixed. (c) VG-HuBERT component effect: $\Delta$F1 when swapping the cue/envelope (SSM $\rightarrow$ GreedyCosine) with the peakdetect segmenter held fixed; positive values indicate improvement from the substituted component.