Table of Contents
Fetching ...

Romanized to Native Malayalam Script Transliteration Using an Encoder-Decoder Framework

Bajiyo Baiju, Kavya Manohar, Leena G Pillai, Elizabeth Sherly

TL;DR

This work tackles reverse transliteration of romanized Malayalam to native Malayalam script using an attention-based Bi-LSTM encoder–decoder trained on a large-scale corpus of transliteration pairs (≈4.344 million) from Dakshina and Aksharantar. It evaluates on IndoNLP 2025 Shared Task datasets, achieving a CER of $7.4\%$ on Test Set-1 and $22.7\%$ on Test Set-2, with corresponding WER and BLEU scores that reflect standard typing robustness but adhoc typing challenges. The study provides a detailed model architecture, input constraints, and training setup, and discusses limitations such as lack of a language model and dataset diversity for irregular inputs. Overall, the approach demonstrates promising real-time transliteration performance for Malayalam while outlining clear directions for improving generalization to adhoc typing styles and vowel-omission patterns.

Abstract

In this work, we present the development of a reverse transliteration model to convert romanized Malayalam to native script using an encoder-decoder framework built with attention-based bidirectional Long Short Term Memory (Bi-LSTM) architecture. To train the model, we have used curated and combined collection of 4.3 million transliteration pairs derived from publicly available Indic language translitertion datasets, Dakshina and Aksharantar. We evaluated the model on two different test dataset provided by IndoNLP-2025-Shared-Task that contain, (1) General typing patterns and (2) Adhoc typing patterns, respectively. On the Test Set-1, we obtained a character error rate (CER) of 7.4%. However upon Test Set-2, with adhoc typing patterns, where most vowel indicators are missing, our model gave a CER of 22.7%.

Romanized to Native Malayalam Script Transliteration Using an Encoder-Decoder Framework

TL;DR

This work tackles reverse transliteration of romanized Malayalam to native Malayalam script using an attention-based Bi-LSTM encoder–decoder trained on a large-scale corpus of transliteration pairs (≈4.344 million) from Dakshina and Aksharantar. It evaluates on IndoNLP 2025 Shared Task datasets, achieving a CER of on Test Set-1 and on Test Set-2, with corresponding WER and BLEU scores that reflect standard typing robustness but adhoc typing challenges. The study provides a detailed model architecture, input constraints, and training setup, and discusses limitations such as lack of a language model and dataset diversity for irregular inputs. Overall, the approach demonstrates promising real-time transliteration performance for Malayalam while outlining clear directions for improving generalization to adhoc typing styles and vowel-omission patterns.

Abstract

In this work, we present the development of a reverse transliteration model to convert romanized Malayalam to native script using an encoder-decoder framework built with attention-based bidirectional Long Short Term Memory (Bi-LSTM) architecture. To train the model, we have used curated and combined collection of 4.3 million transliteration pairs derived from publicly available Indic language translitertion datasets, Dakshina and Aksharantar. We evaluated the model on two different test dataset provided by IndoNLP-2025-Shared-Task that contain, (1) General typing patterns and (2) Adhoc typing patterns, respectively. On the Test Set-1, we obtained a character error rate (CER) of 7.4%. However upon Test Set-2, with adhoc typing patterns, where most vowel indicators are missing, our model gave a CER of 22.7%.

Paper Structure

This paper contains 8 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The distribution of WER, CER and BLEU over the Test Set-1.
  • Figure 2: The distribution of WER, CER and BLEU over the Test Set-2.