Table of Contents
Fetching ...

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Adity Khisa, Nusrat Jahan Lia, Tasnim Mahfuz Nafis, Zarif Masud, Tanzir Pial, Shebuti Rayana, Ahmedul Kabir

TL;DR

This work tackles the challenge of modeling a low-resource language, Chakma, by creating a novel Bangla-transliterated Chakma corpus derived from printed literature and validated by native speakers. It demonstrates that Masked Language Model fine-tuning of six encoder models on this transliterated data yields substantial improvements over pre-trained baselines, with peak token accuracy of 73.54% and perplexity as low as 2.899, and highlights how data quality and OCR fidelity critically influence results. The study shows that transliteration-based transfer learning can effectively bridge resource gaps for Chakma, especially when leveraging multilingual pretraining and careful data curation, and it releases the dataset to spur further research. The findings emphasize the practical importance of high-quality OCR and linguistic fidelity in low-resource multilingual NLP and point to future work on transliteration strategies and broader data sources.

Abstract

As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based transformer models, including multilingual (mBERT, XLM-RoBERTa, DistilBERT), regional (BanglaBERT, IndicBERT), and monolingual English (DeBERTaV3) variants on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our dataset to encourage further research on multilingual language modeling for low-resource languages.

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

TL;DR

This work tackles the challenge of modeling a low-resource language, Chakma, by creating a novel Bangla-transliterated Chakma corpus derived from printed literature and validated by native speakers. It demonstrates that Masked Language Model fine-tuning of six encoder models on this transliterated data yields substantial improvements over pre-trained baselines, with peak token accuracy of 73.54% and perplexity as low as 2.899, and highlights how data quality and OCR fidelity critically influence results. The study shows that transliteration-based transfer learning can effectively bridge resource gaps for Chakma, especially when leveraging multilingual pretraining and careful data curation, and it releases the dataset to spur further research. The findings emphasize the practical importance of high-quality OCR and linguistic fidelity in low-resource multilingual NLP and point to future work on transliteration strategies and broader data sources.

Abstract

As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based transformer models, including multilingual (mBERT, XLM-RoBERTa, DistilBERT), regional (BanglaBERT, IndicBERT), and monolingual English (DeBERTaV3) variants on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our dataset to encourage further research on multilingual language modeling for low-resource languages.

Paper Structure

This paper contains 18 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overall workflow of OCR-based data curation, manual correction, and MLM fine-tuning for Bangla-transliterated Chakma language model
  • Figure 2: Sample data illustrating quality comparison across different methods, highlighting missing sentences in Gemini and spelling errors in other models caused by the misinterpretation of conjunct characters, phonetic signs, vowel diacritics, consonant modifiers, nasalization, and related orthographic features.
  • Figure 3: Comparison of universal multilingual and regional encoder models. Each grouped bar chart is showing the accuracy of pre-trained language models fine-tuned on manually fixed data, categorized by their parameter sizes.
  • Figure 4: Details of Chakma Storybooks Used in the Dataset