Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs
Himanshu Gupta, Pratik Jayarao, Chaitanya Dwivedi, Neeraj Varshney
TL;DR
Code-mixing creates language-confounding behavior and safety vulnerabilities for large language models. The authors present an operational playbook spanning data, modeling, prompting, evaluation, and safety to build CSW-capable LLMs, with emphasis on explicit, targeted interventions across the lifecycle. They detail code-mix-aware pre-training, post-training adaptation, and safety-focused practices, and critique current evaluation practices for overreliance on monolingual benchmarks. The work highlights English-centric biases and advocates for diverse linguistic data, revised evaluation paradigms, and mix-aware safety alignment to enable robust real-world multilingual AI systems.
Abstract
Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.
