BDA: Bangla Text Data Augmentation Framework
Md. Tariquzzaman, Audwit Nafi Anam, Naimul Haque, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan
TL;DR
The paper tackles data scarcity in Bangla NLP by proposing BDA, a hybrid text data augmentation framework that blends rule-based and transformer-based methods with a semantic- and lexical-filtering pipeline. By applying four augmentation techniques—Synonym Replacement, Random Swap, Back-Translation, and Paraphrasing—plus a two-stage filter, BDA creates high-quality synthetic samples while preserving labels. Across five Bangla classification datasets, BDA achieves F1 improvements that rival models trained on the full dataset using only half the data, with ablation showing each method contributing to gains. The work highlights that hybrid augmentation with careful filtering is especially beneficial in data-constrained settings, though benefits taper with abundant data and noisy Bangla text.
Abstract
Data augmentation involves generating synthetic samples that resemble those in a given dataset. In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework that uses both pre-trained models and rule-based methods to create new variants of the text. A filtering process is included to ensure that the new text keeps the same meaning as the original while also adding variety in the words used. We conduct a comprehensive evaluation of the framework's effectiveness in Bangla text classification tasks. Our framework achieved significant improvement in F1 scores across five distinct datasets, delivering performance equivalent to models trained on 100% of the data while utilizing only 50% of the training dataset. Additionally, we explore the impact of data scarcity by progressively reducing the training data and augmenting it through BDA, resulting in notable F1 score enhancements. The study offers a thorough examination of BDA's performance, identifying key factors for optimal results and addressing its limitations through detailed analysis.
