Table of Contents
Fetching ...

Enhancing Multilingual Language Models for Code-Switched Input Data

Katherine Xie, Nitya Babbar, Vicky Chen, Yoanna Turura

TL;DR

The paper investigates enhancing multilingual language understanding by pre-training mBERT on code-switched Spanglish data and evaluating on POS tagging, language identification, NER, and sentiment analysis. Using LinCE benchmarks and a combination of token- and sentence-level tasks, the study shows the pre-trained model generally matches or surpasses a baseline, with the largest gains in POS tagging. A latent-space analysis reveals more homogeneous English and Spanish embeddings after pre-training, offering interpretability into how code-switching is captured. The work underscores the potential of adapting multilingual models to code-switched input for broader global and multilingual applicability, while outlining future directions across language pairs and multimodal data, and highlighting ethical considerations for deployment.

Abstract

Code-switching, or alternating between languages within a single conversation, presents challenges for multilingual language models on NLP tasks. This research investigates if pre-training Multilingual BERT (mBERT) on code-switched datasets improves the model's performance on critical NLP tasks such as part of speech tagging, sentiment analysis, named entity recognition, and language identification. We use a dataset of Spanglish tweets for pre-training and evaluate the pre-trained model against a baseline model. Our findings show that our pre-trained mBERT model outperforms or matches the baseline model in the given tasks, with the most significant improvements seen for parts of speech tagging. Additionally, our latent analysis uncovers more homogenous English and Spanish embeddings for language identification tasks, providing insights for future modeling work. This research highlights potential for adapting multilingual LMs for code-switched input data in order for advanced utility in globalized and multilingual contexts. Future work includes extending experiments to other language pairs, incorporating multiform data, and exploring methods for better understanding context-dependent code-switches.

Enhancing Multilingual Language Models for Code-Switched Input Data

TL;DR

The paper investigates enhancing multilingual language understanding by pre-training mBERT on code-switched Spanglish data and evaluating on POS tagging, language identification, NER, and sentiment analysis. Using LinCE benchmarks and a combination of token- and sentence-level tasks, the study shows the pre-trained model generally matches or surpasses a baseline, with the largest gains in POS tagging. A latent-space analysis reveals more homogeneous English and Spanish embeddings after pre-training, offering interpretability into how code-switching is captured. The work underscores the potential of adapting multilingual models to code-switched input for broader global and multilingual applicability, while outlining future directions across language pairs and multimodal data, and highlighting ethical considerations for deployment.

Abstract

Code-switching, or alternating between languages within a single conversation, presents challenges for multilingual language models on NLP tasks. This research investigates if pre-training Multilingual BERT (mBERT) on code-switched datasets improves the model's performance on critical NLP tasks such as part of speech tagging, sentiment analysis, named entity recognition, and language identification. We use a dataset of Spanglish tweets for pre-training and evaluate the pre-trained model against a baseline model. Our findings show that our pre-trained mBERT model outperforms or matches the baseline model in the given tasks, with the most significant improvements seen for parts of speech tagging. Additionally, our latent analysis uncovers more homogenous English and Spanish embeddings for language identification tasks, providing insights for future modeling work. This research highlights potential for adapting multilingual LMs for code-switched input data in order for advanced utility in globalized and multilingual contexts. Future work includes extending experiments to other language pairs, incorporating multiform data, and exploring methods for better understanding context-dependent code-switches.

Paper Structure

This paper contains 16 sections, 5 figures.

Figures (5)

  • Figure 1: Pipeline for evaluating the impact of fine-tuning mBERT on code-switched data
  • Figure 2: Validation Results on NLP Tasks Using LinCE Benchmark
  • Figure 3: Pre-trained vs. Baseline mBERT Model Validation Accuracy
  • Figure 4: Pre-trained vs. Baseline mBERT Validation Loss on POS Task
  • Figure 5: UMAP visualization of English and Spanish word embeddings before and after fine-tuning on code-switched data.