Table of Contents
Fetching ...

Multilingual Large Language Models Are Not (Yet) Code-Switchers

Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Indra Winata, Alham Fikri Aji

TL;DR

This study analyzes how well multilingual LLMs handle code-switching by evaluating four CSW tasks (sentiment analysis, machine translation, summarization, and word-level LID) under zero-shot, few-shot, and fine-tuning regimes. It finds that while prompting and scaling yield some gains, fine-tuned smaller models consistently outperform the largest multilingual LLMs, with ChatGPT showing competitive performance but limited transparency. The authors argue that current multilingual LLMs do not inherently master code-switching, and they propose data-centric and objective-driven directions (e.g., CSW-focused data representation, token-level objectives) to bridge this gap. The work emphasizes the need for inclusive language technologies that reflect real-world code-switching and offers practical guidance for future model development and evaluation. Overall, the paper provides a rigorous, task-diverse benchmark and clear implications for advancing true polyglot CSW capabilities in NLP systems.

Abstract

Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods. While there have been extensive studies on their abilities in monolingual tasks, the investigation of their potential in the context of code-switching (CSW), the practice of alternating languages within an utterance, remains relatively uncharted. In this paper, we provide a comprehensive empirical analysis of various multilingual LLMs, benchmarking their performance across four tasks: sentiment analysis, machine translation, summarization and word-level language identification. Our results indicate that despite multilingual LLMs exhibiting promising outcomes in certain tasks using zero or few-shot prompting, they still underperform in comparison to fine-tuned models of much smaller scales. We argue that current "multilingualism" in LLMs does not inherently imply proficiency with code-switching texts, calling for future research to bridge this discrepancy.

Multilingual Large Language Models Are Not (Yet) Code-Switchers

TL;DR

This study analyzes how well multilingual LLMs handle code-switching by evaluating four CSW tasks (sentiment analysis, machine translation, summarization, and word-level LID) under zero-shot, few-shot, and fine-tuning regimes. It finds that while prompting and scaling yield some gains, fine-tuned smaller models consistently outperform the largest multilingual LLMs, with ChatGPT showing competitive performance but limited transparency. The authors argue that current multilingual LLMs do not inherently master code-switching, and they propose data-centric and objective-driven directions (e.g., CSW-focused data representation, token-level objectives) to bridge this gap. The work emphasizes the need for inclusive language technologies that reflect real-world code-switching and offers practical guidance for future model development and evaluation. Overall, the paper provides a rigorous, task-diverse benchmark and clear implications for advancing true polyglot CSW capabilities in NLP systems.

Abstract

Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods. While there have been extensive studies on their abilities in monolingual tasks, the investigation of their potential in the context of code-switching (CSW), the practice of alternating languages within an utterance, remains relatively uncharted. In this paper, we provide a comprehensive empirical analysis of various multilingual LLMs, benchmarking their performance across four tasks: sentiment analysis, machine translation, summarization and word-level language identification. Our results indicate that despite multilingual LLMs exhibiting promising outcomes in certain tasks using zero or few-shot prompting, they still underperform in comparison to fine-tuned models of much smaller scales. We argue that current "multilingualism" in LLMs does not inherently imply proficiency with code-switching texts, calling for future research to bridge this discrepancy.
Paper Structure (34 sections, 6 figures, 4 tables)

This paper contains 34 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of tasks included in our benchmark study.
  • Figure 2: Evaluation results of fine-tuning and prompting LLMs of different scales on various CSW tasks. (top left) F1-score on the sentiment analysis task, (top right) BLEU score on the machine translation task, (bottom left) ROUGE-L on the summarization task, and (bottom right) F1-score on the word-level language identification task. (FT) means results are from fine-tuned models.
  • Figure 3: Performance comparison on (top) Hindi$\rightarrow$English vs Hinglish$\rightarrow$English translation and (bottom) Hinglish$\rightarrow$English vs English$\rightarrow$English summarization.
  • Figure 4: Few-shot evaluation performance for (top left) sentiment analysis task, (top right) machine translation task, (bottom left) summarization task and (bottom right) word-level LID task.
  • Figure 5: LLMs' sentiment analysis evaluation on (left) Sentimix Spanish-English, (center) MixSentiment Malayaman-English, and (right) MixSentiment Tamil-English.
  • ...and 1 more figures