Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

Md Nishat Raihan; Dhiman Goswami; Antara Mahmud

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

Md Nishat Raihan, Dhiman Goswami, Antara Mahmud

TL;DR

Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distils-BERt, a model fine-tuned on code-mixed data are introduced, which are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R.

Abstract

One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

TL;DR

Abstract

Paper Structure (10 sections, 5 figures, 5 tables)

This paper contains 10 sections, 5 figures, 5 tables.

Introduction
Background and Related Works
Background on Important or Non-Standard Concepts
Related work
Proposed Approach
Experiments
Datasets
Model Pre-train
Results and Analysis
Conclusion

Figures (5)

Figure 1: Workflow of the pre-trained models
Figure 2: Training Loss: Tri-Distil-BERT (left), Mixed-Distil-BERT (right)
Figure 3: 3 Language Code-Mixed Emotion Detection: Weighted F1-Score Comparison
Figure 4: 3 Language Code-Mixed Sentiment Analysis: Weighted F1-Score Comparison
Figure 5: 3 Language Code-Mixed Offensive Language Identification: Weighted F1-Score Comparison

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

TL;DR

Abstract

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

Authors

TL;DR

Abstract

Table of Contents

Figures (5)