Table of Contents
Fetching ...

Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs

Tamzeed Mahfuz, Satak Kumar Dey, Ruwad Naswan, Hasnaen Adil, Khondker Salman Sayeed, Haz Sameen Shahgir

TL;DR

This study probes the necessity and viability of Bengali-oriented LLMs by benchmarking open-weight and closed-source LLMs against fine-tuned encoder–decoder baselines across Bengali NLU and NLG tasks. It finds that English-centric LLMs excel in reasoning and understanding but struggle with Bengali-script generation due to tokenization inefficiencies and biased machine-translated datasets, while Bengali-specific models still face data and resource constraints. The work highlights a strong need for Bengali-focused pretraining and instruction-tuning data, but argues that immediate benefits can be gained by leveraging high-quality translation models with powerful English LLMs, alongside efforts to improve Bengali tokenization and corpus quality. The paper emphasizes careful dataset creation and evaluation, and calls for a phased approach that combines translation-based workflows with targeted Bengali data gathering to advance practical Bengali NLP in the near term. Overall, it underscores the trade-offs between model scale, data quality, and linguistic fit in advancing Bengali language technology.

Abstract

Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model.

Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs

TL;DR

This study probes the necessity and viability of Bengali-oriented LLMs by benchmarking open-weight and closed-source LLMs against fine-tuned encoder–decoder baselines across Bengali NLU and NLG tasks. It finds that English-centric LLMs excel in reasoning and understanding but struggle with Bengali-script generation due to tokenization inefficiencies and biased machine-translated datasets, while Bengali-specific models still face data and resource constraints. The work highlights a strong need for Bengali-focused pretraining and instruction-tuning data, but argues that immediate benefits can be gained by leveraging high-quality translation models with powerful English LLMs, alongside efforts to improve Bengali tokenization and corpus quality. The paper emphasizes careful dataset creation and evaluation, and calls for a phased approach that combines translation-based workflows with targeted Bengali data gathering to advance practical Bengali NLP in the near term. Overall, it underscores the trade-offs between model scale, data quality, and linguistic fit in advancing Bengali language technology.

Abstract

Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model.
Paper Structure (49 sections, 2 figures, 10 tables)

This paper contains 49 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Large Language Model Training Pipeline and Resource Comparison between BanglaT5 banglat5 vs. LLaMA-3 llama3. The first step towards capable Bengali LLMs is collecting a large pretraining corpus. However, the iterative nature of LLM development makes it unlikely that sufficient pretraining data and compute alone would enable Bengali LLMs to match the capabilities of their English-oriented counterparts.
  • Figure 2: Qualitative Examples of Inefficient Bengali Script Tokenization.