Table of Contents
Fetching ...

Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches

Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, Surangika Ranathunga

TL;DR

The paper addresses transliterating Singlish to Sinhala by comparing a rule-based baseline with a Transformer-based Seq2Seq approach trained via fine-tuning a pre-trained multilingual model (M2M100). It shows that the Transformer-based method better handles ad-hoc Romanization patterns and code-mixed input, though it is less efficient on CPU than the rule-based system. A 10k-parallel dataset, augmented with realistic typing variations, enables evaluation on IndoNLP COLING 2025 test sets, where the DL method demonstrates higher accuracy (lower WER/CER, higher BLEU) but with trade-offs in speed. The work provides a practical transliteration framework for Sinhala and underscores the value of data-driven approaches for low-resource scripts, while offering open-source tooling and directions for integrating larger language models in future work.

Abstract

Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - https://github.com/kasunw22/Sinhala-Transliterator/

Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches

TL;DR

The paper addresses transliterating Singlish to Sinhala by comparing a rule-based baseline with a Transformer-based Seq2Seq approach trained via fine-tuning a pre-trained multilingual model (M2M100). It shows that the Transformer-based method better handles ad-hoc Romanization patterns and code-mixed input, though it is less efficient on CPU than the rule-based system. A 10k-parallel dataset, augmented with realistic typing variations, enables evaluation on IndoNLP COLING 2025 test sets, where the DL method demonstrates higher accuracy (lower WER/CER, higher BLEU) but with trade-offs in speed. The work provides a practical transliteration framework for Sinhala and underscores the value of data-driven approaches for low-resource scripts, while offering open-source tooling and directions for integrating larger language models in future work.

Abstract

Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - https://github.com/kasunw22/Sinhala-Transliterator/
Paper Structure (15 sections, 7 tables, 1 algorithm)