Table of Contents
Fetching ...

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

Md Nishat Raihan, Dhiman Goswami, Antara Mahmud

TL;DR

Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distils-BERt, a model fine-tuned on code-mixed data are introduced, which are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R.

Abstract

One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

TL;DR

Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distils-BERt, a model fine-tuned on code-mixed data are introduced, which are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R.

Abstract

One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.
Paper Structure (10 sections, 5 figures, 5 tables)

This paper contains 10 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Workflow of the pre-trained models
  • Figure 2: Training Loss: Tri-Distil-BERT (left), Mixed-Distil-BERT (right)
  • Figure 3: 3 Language Code-Mixed Emotion Detection: Weighted F1-Score Comparison
  • Figure 4: 3 Language Code-Mixed Sentiment Analysis: Weighted F1-Score Comparison
  • Figure 5: 3 Language Code-Mixed Offensive Language Identification: Weighted F1-Score Comparison