Table of Contents
Fetching ...

PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model

Davor Lauc

TL;DR

PolyIPA tackles multilingual phoneme-to-grapheme conversion for transliteration, onomastics, and information retrieval by reversing G2P with robust data augmentation. It introduces IPA2vec and SimilarIPA as two complementary augmentation pipelines and extends the training data with underrepresented proper names to improve cross-linguistic adaptation. On a diverse multilingual test set, it achieves a mean CER of $0.055$, a character BLEU of $0.914$, and an exact-match rate of $0.830$, with Top-3 WER at $0.026$, and beam search reduces CER further to $0.026$ for the top-3 outputs. The work demonstrates strong performance on shallow orthographies and outlines a comprehensive path for future enhancements in model architecture, data enrichment, and evaluation frameworks to advance cross-linguistic P2G transliteration.

Abstract

This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7\% (to CER: 0.026), demonstrating the model's effectiveness for cross-linguistic applications.

PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model

TL;DR

PolyIPA tackles multilingual phoneme-to-grapheme conversion for transliteration, onomastics, and information retrieval by reversing G2P with robust data augmentation. It introduces IPA2vec and SimilarIPA as two complementary augmentation pipelines and extends the training data with underrepresented proper names to improve cross-linguistic adaptation. On a diverse multilingual test set, it achieves a mean CER of , a character BLEU of , and an exact-match rate of , with Top-3 WER at , and beam search reduces CER further to for the top-3 outputs. The work demonstrates strong performance on shallow orthographies and outlines a comprehensive path for future enhancements in model architecture, data enrichment, and evaluation frameworks to advance cross-linguistic P2G transliteration.

Abstract

This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7\% (to CER: 0.026), demonstrating the model's effectiveness for cross-linguistic applications.

Paper Structure

This paper contains 35 sections, 3 figures.

Figures (3)

  • Figure 1: Distribution of Character Error Rates across all languages
  • Figure 2: Top-1 Performance Comparison (Top 20 languages by sample count))
  • Figure 3: Top-3 Performance Comparison (Top 20 languages by sample count))