ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration
Aunabil Chakma, Aditya Chakma, Masum Hasan, Soham Khisa, Chumui Tripura, Rifat Shahriyar
TL;DR
This study addresses machine translation for Chakma, an endangered language with scarce data, by releasing a Chakma–Bangla parallel and monolingual dataset plus a tri-lingual evaluation benchmark. It introduces a transliteration-based script-bridging pipeline to enable cross-script transfer from Bangla-pretrained models and large language models, and rigorously benchmarks from-scratch MT, fine-tuned pretrained models, and ICL. Key findings show transliteration is essential, ICL can outperform fine-tuning in data-scarce settings with directionality, and bilingual fine-tuning generally outperforms multilingual training in this context. The work provides strong baselines and practical resources to advance Chakma NLP and language preservation, highlighting transliteration as a crucial enabler for applying high-resource transfer to extremely low-resource languages.
Abstract
We present the first systematic study of machine translation for Chakma, an endangered and extremely low-resource Indo-Aryan language, with the goal of supporting language access and preservation. We introduce a new Chakma-Bangla parallel and monolingual dataset, along with a trilingual Chakma-Bangla-English benchmark for evaluation. To address script mismatch and data scarcity, we propose a character-level transliteration framework that exploits the close orthographic and phonological relationship between Chakma and Bangla, preserving semantic content while enabling effective transfer from Bangla and multilingual pretrained models. We benchmark from-scratch MT, fine-tuned pretrained models, and large language models via in-context learning. Results show that transliteration is essential and that fine-tuning and in-context learning substantially outperform from-scratch baselines, with strong asymmetry across translation directions.
