Table of Contents
Fetching ...

IndoNLP 2025: Shared Task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages

Deshan Sumanathilaka, Isuri Anuradha, Ruvan Weerasinghe, Nicholas Micallef, Julian Hough

TL;DR

This paper presents the IndoNLP 2025 shared task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages, addressing the conversion of ad-hoc Romanized input into native scripts (Sinhala, Hindi, Malayalam) in real time. It introduces training datasets (Dakshina, Aksharantar) and Sinhala-focused Swa-Bhasha, and a diverse test set with general and ad-hoc typing patterns across five languages, evaluated via BLEU, WER, and CER. Results show deep-learning approaches, especially BERT-based models, outperform rule-based methods, with Team Vectora achieving the strongest Sinhala transliteration (BLEU ~0.91, WER < 0.09, CER ~0.02). Ad-hoc typing remains a significant challenge across languages, highlighting the need for robust context-aware disambiguation and language-specific modeling to improve practical transliteration tooling. The work provides valuable benchmarks and insights for advancing inclusive digital access to Indo-Aryan language scripts.

Abstract

The paper overviews the shared task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages. It focuses on the reverse transliteration of low-resourced languages in the Indo-Aryan family to their native scripts. Typing Romanized Indo-Aryan languages using ad-hoc transliterals and achieving accurate native scripts are complex and often inaccurate processes with the current keyboard systems. This task aims to introduce and evaluate a real-time reverse transliterator that converts Romanized Indo-Aryan languages to their native scripts, improving the typing experience for users. Out of 11 registered teams, four teams participated in the final evaluation phase with transliteration models for Sinhala, Hindi and Malayalam. These proposed solutions not only solve the issue of ad-hoc transliteration but also empower low-resource language usability in the digital arena.

IndoNLP 2025: Shared Task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages

TL;DR

This paper presents the IndoNLP 2025 shared task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages, addressing the conversion of ad-hoc Romanized input into native scripts (Sinhala, Hindi, Malayalam) in real time. It introduces training datasets (Dakshina, Aksharantar) and Sinhala-focused Swa-Bhasha, and a diverse test set with general and ad-hoc typing patterns across five languages, evaluated via BLEU, WER, and CER. Results show deep-learning approaches, especially BERT-based models, outperform rule-based methods, with Team Vectora achieving the strongest Sinhala transliteration (BLEU ~0.91, WER < 0.09, CER ~0.02). Ad-hoc typing remains a significant challenge across languages, highlighting the need for robust context-aware disambiguation and language-specific modeling to improve practical transliteration tooling. The work provides valuable benchmarks and insights for advancing inclusive digital access to Indo-Aryan language scripts.

Abstract

The paper overviews the shared task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages. It focuses on the reverse transliteration of low-resourced languages in the Indo-Aryan family to their native scripts. Typing Romanized Indo-Aryan languages using ad-hoc transliterals and achieving accurate native scripts are complex and often inaccurate processes with the current keyboard systems. This task aims to introduce and evaluate a real-time reverse transliterator that converts Romanized Indo-Aryan languages to their native scripts, improving the typing experience for users. Out of 11 registered teams, four teams participated in the final evaluation phase with transliteration models for Sinhala, Hindi and Malayalam. These proposed solutions not only solve the issue of ad-hoc transliteration but also empower low-resource language usability in the digital arena.
Paper Structure (12 sections, 1 figure, 3 tables)