Table of Contents
Fetching ...

BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation

Kazi Reyazul Hasan, Mubasshira Musarrat, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan

TL;DR

BanglaSTEM tackles the translation barrier for Bangla-speaking STEM problems by creating a high-quality Bangla-English parallel corpus and validating its practical value through downstream tasks. It combines multi-model translation generation with rigorous human curation and a composite score $Q = 0.4 \cdot TA + 0.4 \cdot TT + 0.2 \cdot LN$ to select 5,000 high-quality pairs and then fine-tunes BanglaT5 on this data. On code generation and mathematical problem solving, the fine-tuned translator achieves 82.5% and 79% performance, respectively, markedly outperforming direct Bangla, Google Translate, and a base BanglaT5 baseline. By releasing both the corpus and the tuned model, the work enables Bangla speakers to leverage English-centric reasoning tools for technical tasks more effectively.

Abstract

Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.

BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation

TL;DR

BanglaSTEM tackles the translation barrier for Bangla-speaking STEM problems by creating a high-quality Bangla-English parallel corpus and validating its practical value through downstream tasks. It combines multi-model translation generation with rigorous human curation and a composite score to select 5,000 high-quality pairs and then fine-tunes BanglaT5 on this data. On code generation and mathematical problem solving, the fine-tuned translator achieves 82.5% and 79% performance, respectively, markedly outperforming direct Bangla, Google Translate, and a base BanglaT5 baseline. By releasing both the corpus and the tuned model, the work enables Bangla speakers to leverage English-centric reasoning tools for technical tasks more effectively.

Abstract

Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.

Paper Structure

This paper contains 18 sections, 1 equation, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of the BanglaSTEM pipeline. The dataset construction phase (left) generates 12,711 translation candidates using three LLMs, which undergo human curation and quality-based selection to produce 5,000 high-quality parallel sentences. The model training and evaluation phase (right) fine-tunes BanglaT5 on this dataset and evaluates performance on code generation and mathematical problem-solving tasks, achieving 82.5% and 79.0% accuracy respectively.