Table of Contents
Fetching ...

AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer

Davor Lauc, Attapol Rutherford, Weerin Wongwarawipatr

TL;DR

This work addresses the challenge of transliterating Thai proper names into Latin script, a task complicated by Thai phonology and diverse personal spelling practices. It presents AyutthayaAlpha, a pair of transformer-based transliterators built on ByT5 that are trained on a large, high-quality Thai-Latin name dataset with upsampling to maximize informative examples. The approach achieves state-of-the-art performance, with strong first-token and token-level accuracies and a very low character error rate, while effectively capturing pronunciation nuances and cultural preferences. The model has practical implications for cross-lingual information retrieval, identity verification, and international data standardization, and the authors discuss bidirectional processing, data augmentation, and broader multi-script extensions as avenues for future work.

Abstract

This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.

AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer

TL;DR

This work addresses the challenge of transliterating Thai proper names into Latin script, a task complicated by Thai phonology and diverse personal spelling practices. It presents AyutthayaAlpha, a pair of transformer-based transliterators built on ByT5 that are trained on a large, high-quality Thai-Latin name dataset with upsampling to maximize informative examples. The approach achieves state-of-the-art performance, with strong first-token and token-level accuracies and a very low character error rate, while effectively capturing pronunciation nuances and cultural preferences. The model has practical implications for cross-lingual information retrieval, identity verification, and international data standardization, and the authors discuss bidirectional processing, data augmentation, and broader multi-script extensions as avenues for future work.

Abstract

This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.

Paper Structure

This paper contains 31 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Sample of the Transliteration Evaluation Dataset
  • Figure 2: Sample of the Training Dataset
  • Figure 3: Example Predictions from AyutthayaAlpha Models