Table of Contents
Fetching ...

ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration

Aunabil Chakma, Aditya Chakma, Masum Hasan, Soham Khisa, Chumui Tripura, Rifat Shahriyar

TL;DR

This study addresses machine translation for Chakma, an endangered language with scarce data, by releasing a Chakma–Bangla parallel and monolingual dataset plus a tri-lingual evaluation benchmark. It introduces a transliteration-based script-bridging pipeline to enable cross-script transfer from Bangla-pretrained models and large language models, and rigorously benchmarks from-scratch MT, fine-tuned pretrained models, and ICL. Key findings show transliteration is essential, ICL can outperform fine-tuning in data-scarce settings with directionality, and bilingual fine-tuning generally outperforms multilingual training in this context. The work provides strong baselines and practical resources to advance Chakma NLP and language preservation, highlighting transliteration as a crucial enabler for applying high-resource transfer to extremely low-resource languages.

Abstract

We present the first systematic study of machine translation for Chakma, an endangered and extremely low-resource Indo-Aryan language, with the goal of supporting language access and preservation. We introduce a new Chakma-Bangla parallel and monolingual dataset, along with a trilingual Chakma-Bangla-English benchmark for evaluation. To address script mismatch and data scarcity, we propose a character-level transliteration framework that exploits the close orthographic and phonological relationship between Chakma and Bangla, preserving semantic content while enabling effective transfer from Bangla and multilingual pretrained models. We benchmark from-scratch MT, fine-tuned pretrained models, and large language models via in-context learning. Results show that transliteration is essential and that fine-tuning and in-context learning substantially outperform from-scratch baselines, with strong asymmetry across translation directions.

ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration

TL;DR

This study addresses machine translation for Chakma, an endangered language with scarce data, by releasing a Chakma–Bangla parallel and monolingual dataset plus a tri-lingual evaluation benchmark. It introduces a transliteration-based script-bridging pipeline to enable cross-script transfer from Bangla-pretrained models and large language models, and rigorously benchmarks from-scratch MT, fine-tuned pretrained models, and ICL. Key findings show transliteration is essential, ICL can outperform fine-tuning in data-scarce settings with directionality, and bilingual fine-tuning generally outperforms multilingual training in this context. The work provides strong baselines and practical resources to advance Chakma NLP and language preservation, highlighting transliteration as a crucial enabler for applying high-resource transfer to extremely low-resource languages.

Abstract

We present the first systematic study of machine translation for Chakma, an endangered and extremely low-resource Indo-Aryan language, with the goal of supporting language access and preservation. We introduce a new Chakma-Bangla parallel and monolingual dataset, along with a trilingual Chakma-Bangla-English benchmark for evaluation. To address script mismatch and data scarcity, we propose a character-level transliteration framework that exploits the close orthographic and phonological relationship between Chakma and Bangla, preserving semantic content while enabling effective transfer from Bangla and multilingual pretrained models. We benchmark from-scratch MT, fine-tuned pretrained models, and large language models via in-context learning. Results show that transliteration is essential and that fine-tuning and in-context learning substantially outperform from-scratch baselines, with strong asymmetry across translation directions.

Paper Structure

This paper contains 56 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Illustrative Chakma$\rightarrow$Bangla translation example comparing our two best-performing approaches: fine-tuned NMT (BanglaT5) and in-context learning (GPT with random 400 examples). Despite similar automatic scores, the outputs differ in lexical choice and interpretation.
  • Figure 2: Distribution of Chakma monolingual data by content type. The corpus contains 42,783 monolingual samples collected from diverse sources, including dictionaries, stories, textbooks, poems, novels, and articles.
  • Figure 3: Nearest-character substitutions used in the Chakma–Bangla transliteration system for characters without direct one-to-one mappings. These substitutions preserve semantic content and approximate pronunciation, while potentially neutralizing non-contrastive orthographic distinctions. The only entry marked with a dash (–) in the Bangla column corresponds to a rare Chakma prosodic lengthening marker that lacks an explicit Bangla graphemic equivalent and is normalized during transliteration. All four Chakma characters without direct Bangla counterparts are extremely rare in contemporary usage and have negligible impact on downstream translation quality.
  • Figure 4: Prompt format used for our few-shot in-context learning (ICL) experiments, illustrating the structure of source-target examples, task instructions, and test-time input.
  • Figure 5: Qualitative examples of BN$\rightarrow$CCP translations generated by different models on short and long sentences, illustrating differences in semantic adequacy and robustness across approaches. Parentheses () provide a literal English gloss of each model output for readability, while square brackets [] give a brief qualitative analysis comparing translation quality. Overall, BanglaT5 and GPT models using in-context learning (ICL) produce more semantically faithful translations than earlier SMT, RNN, and baseline Transformer approaches, particularly for longer sentences.
  • ...and 1 more figures