Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Himanshu Gupta; Pratik Jayarao; Chaitanya Dwivedi; Neeraj Varshney

Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Himanshu Gupta, Pratik Jayarao, Chaitanya Dwivedi, Neeraj Varshney

TL;DR

Code-mixing creates language-confounding behavior and safety vulnerabilities for large language models. The authors present an operational playbook spanning data, modeling, prompting, evaluation, and safety to build CSW-capable LLMs, with emphasis on explicit, targeted interventions across the lifecycle. They detail code-mix-aware pre-training, post-training adaptation, and safety-focused practices, and critique current evaluation practices for overreliance on monolingual benchmarks. The work highlights English-centric biases and advocates for diverse linguistic data, revised evaluation paradigms, and mix-aware safety alignment to enable robust real-world multilingual AI systems.

Abstract

Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.

Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

TL;DR

Abstract

Paper Structure (57 sections, 1 figure, 1 table)

This paper contains 57 sections, 1 figure, 1 table.

Introduction
Background: Linguistic Foundations and Code-mix Applications
Core Definitions
A Typology of Code-Mixing
Bilingual and English-Centric Mixing
Multilingual and Non-English Scenarios
Code-mix Applications
Prompting and In-Context Learning
Code-Mix Prompting to Unlock Multilingual Capabilities
Bridging Representations via In-Context Mixing.
Overcoming Script Barriers via Transliteration.
Activating Cultural Knowledge.
Practitioner's Insight: The Multilingual Unlock
Prompting Strategies for Controlled Code-Mixed Generation
Explicit Definitions over Role-Playing.
...and 42 more sections

Figures (1)

Figure 1: A comprehensive framework for addressing code-mixing in the LLM era, spanning data, modeling, prompting, and evaluation to build more robust and safe multilingual systems.

Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

TL;DR

Abstract

Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (1)