The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

Jian Zhu; Changbing Yang; Farhan Samir; Jahurul Islam

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

Jian Zhu, Changbing Yang, Farhan Samir, Jahurul Islam

TL;DR

This work tackles the challenge of cross-linguistic generalization in multilingual speech processing by shifting from text-based representations to universal phonemic symbols (IPA). It introduces IPAPACK, a large, linguist-validated phonemic corpus spanning 115 languages, and CLAP-IPA, a contrastive phoneme-speech model capable of open-vocabulary keyword spotting and zero-shot matching to phonemic sequences. A dedicated forced-alignment pathway, IPA-ALIGNER, is developed by finetuning with Forward-Sum loss, enabling monotonic phoneme-to-audio alignment in unseen languages. The results show strong cross-linguistic generalization for KWS and zeroshot forced alignment, with phoneme-based modeling outperforming text-based baselines in many multilingual settings, and highlight practical implications for multilingual documentation and low-resource language processing. The work also discusses data governance, ethical considerations, and the tradeoffs between scaling languages versus training hours, indicating a promising direction for scalable, language-inclusive speech technology.

Abstract

In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation.

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

TL;DR

Abstract

Paper Structure (42 sections, 5 equations, 3 figures, 9 tables)

This paper contains 42 sections, 5 equations, 3 figures, 9 tables.

Introduction
Backgrounds
Spoken keyword detection and retrieval
Forced alignment
Dataset curation
Phonemic transcriptions
FLEURS
MSWC
DoReCo
Dataset validation
Method
Contrastive learning for KWS
Speech encoder
Phoneme tokenizer
Phoneme encoder
...and 27 more sections

Figures (3)

Figure 1: Illustration of adaptive average-pooling of phoneme representations, $\mathbf{M_p}\mathbf{H_p} = \mathbf{H_p^{\prime}}$.
Figure 2: Illustration of forced alignment in an Evenki utterance. Clap-Ipa exhibits vague monotonic alignment without finetuning (Top). After finetuning, Ipa-Aligner learns salient monotonic alignment between speech and phonemes (Bottom).
Figure 3: Correlation of model performance on individual languages with training hours by language. Languages are represented by their ISO 639-3 codes. While trained the exact same data, the phoneme-based model outperforms the text-based model in every single language, suggesting that phoneme-based modeling enables knowledge transfer across languages.

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

TL;DR

Abstract

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

Authors

TL;DR

Abstract

Table of Contents

Figures (3)