Table of Contents
Fetching ...

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh

TL;DR

Code-switching NLP remains a major challenge in the era of large language models. This survey analyzes 308 studies across 12 NLP tasks, 30+ datasets, and 80+ languages to map how LLMs reshape CSW modeling, from pre-LLM rule-based approaches to modern instruction-tuned and multimodal systems. It presents a taxonomy of five core CSW research axes—architecture, training paradigm, and evaluation—highlights key datasets and benchmarks (e.g., MEGAVERSE, MultiCoNER, CodeMixBench), and identifies persistent gaps in low-resource languages, script diversity, and generation reliability. The paper offers a practical roadmap emphasizing inclusive data collection, fair, CS-aware evaluation, and linguistically grounded modeling to advance truly multilingual intelligence. Collectively, these insights guide researchers and developers toward robust, equitable CSW NLP suitable for multilingual societies.

Abstract

Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multilingual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 308 studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

TL;DR

Code-switching NLP remains a major challenge in the era of large language models. This survey analyzes 308 studies across 12 NLP tasks, 30+ datasets, and 80+ languages to map how LLMs reshape CSW modeling, from pre-LLM rule-based approaches to modern instruction-tuned and multimodal systems. It presents a taxonomy of five core CSW research axes—architecture, training paradigm, and evaluation—highlights key datasets and benchmarks (e.g., MEGAVERSE, MultiCoNER, CodeMixBench), and identifies persistent gaps in low-resource languages, script diversity, and generation reliability. The paper offers a practical roadmap emphasizing inclusive data collection, fair, CS-aware evaluation, and linguistically grounded modeling to advance truly multilingual intelligence. Collectively, these insights guide researchers and developers toward robust, equitable CSW NLP suitable for multilingual societies.

Abstract

Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multilingual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 308 studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.

Paper Structure

This paper contains 98 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Common model failures on code-mixed text: Takeaway (a) hallucination in MT translation (Bn-Hi-En), (b) factual inconsistency in POS tagging (Sp-En), and (c) misinterpretation in SA (Kz-Ru).
  • Figure 2: A taxonomy of the code-switching research landscape. Takeaway The mind map highlights important works across the diverse categories of code-switching research.
  • Figure 3: Failure cases when we prompt ChatGPT in Odia-Romanized Hindi code-mixed pair.
  • Figure 4: Failure cases when we prompt GLM-4.6 in Bangla-English code-mixed pair.
  • Figure 5: Failure cases when we prompt Perplexity in Konkani-English code-mixed pair.
  • ...and 4 more figures