Table of Contents
Fetching ...

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazra, Animesh Mukherjee

TL;DR

This study reveals a critical vulnerability: code-mixed inputs erode safety alignment in open LLMs by causing attributional drift away from safety-critical tokens. It introduces Saliency Drift Attribution (SDA) to diagnose how code-mixing shifts token-level saliency, and demonstrates that a lightweight, translation-based restoration pivots prompts toward an English representation to recover roughly 80% of lost safety. The work combines large-scale multilingual evaluations across 10 cultures and real-world social media data, plus human validation, to show that monolingual safety priors do not transfer robustly to code-mixed contexts. It argues for attribution-aware alignment and culturally robust evaluation pipelines, along with a practical mitigation that preserves usefulness while reducing harmful generations in multilingual deployments.

Abstract

While LLMs appear robustly safety-aligned in English, we uncover a catastrophic, overlooked weakness: attributional collapse under code-mixed perturbations. Our systematic evaluation of open models shows that the linguistic camouflage of code-mixing -- ``blending languages within a single conversation'' -- can cause safety guardrails to fail dramatically. Attack success rates (ASR) spike from a benign 9\% in monolingual English to 69\% under code-mixed inputs, with rates exceeding 90\% in non-Western contexts such as Arabic and Hindi. These effects hold not only on controlled synthetic datasets but also on real-world social media traces, revealing a serious risk for billions of users. To explain why this happens, we introduce saliency drift attribution (SDA), an interpretability framework that shows how, under code-mixing, the model's internal attention drifts away from safety-critical tokens (e.g., ``violence'' or ``corruption''), effectively blinding it to harmful intent. Finally, we propose a lightweight translation-based restoration strategy that recovers roughly 80\% of the safety lost to code-mixing, offering a practical path toward more equitable and robust LLM safety.

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

TL;DR

This study reveals a critical vulnerability: code-mixed inputs erode safety alignment in open LLMs by causing attributional drift away from safety-critical tokens. It introduces Saliency Drift Attribution (SDA) to diagnose how code-mixing shifts token-level saliency, and demonstrates that a lightweight, translation-based restoration pivots prompts toward an English representation to recover roughly 80% of lost safety. The work combines large-scale multilingual evaluations across 10 cultures and real-world social media data, plus human validation, to show that monolingual safety priors do not transfer robustly to code-mixed contexts. It argues for attribution-aware alignment and culturally robust evaluation pipelines, along with a practical mitigation that preserves usefulness while reducing harmful generations in multilingual deployments.

Abstract

While LLMs appear robustly safety-aligned in English, we uncover a catastrophic, overlooked weakness: attributional collapse under code-mixed perturbations. Our systematic evaluation of open models shows that the linguistic camouflage of code-mixing -- ``blending languages within a single conversation'' -- can cause safety guardrails to fail dramatically. Attack success rates (ASR) spike from a benign 9\% in monolingual English to 69\% under code-mixed inputs, with rates exceeding 90\% in non-Western contexts such as Arabic and Hindi. These effects hold not only on controlled synthetic datasets but also on real-world social media traces, revealing a serious risk for billions of users. To explain why this happens, we introduce saliency drift attribution (SDA), an interpretability framework that shows how, under code-mixing, the model's internal attention drifts away from safety-critical tokens (e.g., ``violence'' or ``corruption''), effectively blinding it to harmful intent. Finally, we propose a lightweight translation-based restoration strategy that recovers roughly 80\% of the safety lost to code-mixing, offering a practical path toward more equitable and robust LLM safety.

Paper Structure

This paper contains 34 sections, 10 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Illustrative examples comparing model outputs for monolingual and code-mixed prompts (generated using LLaMA-3.1-8B).
  • Figure 2: Actual vs generated code-mixed examples using the MLF technique.
  • Figure 3: Token-wise attribution alignment between an English prompt (top) and its code-mixed Hindi counterpart (bottom) using sequence attribution scores.
  • Figure 4: Case 1: English inputs yield non-harmful responses, while Bengali and Hindi code-mixed variants triggered harmful outputs. Both word shift graphs show harmful tokens (e.g., destruction, scum) that have higher attribution in English () than code-mixing (). Word clouds are present in Figure \ref{['fig:case_wordclouds_appd']}.
  • Figure 5: Case 2: Both English and code-mixed inputs produce harmful responses. Both word shift graphs (English: (), code-mixing: ()) show that harmful tokens retain high attribution, indicating stable but unsafe behavior across modalities. Word clouds are present in Figure \ref{['fig:case1_wordclouds_appd']}.
  • ...and 10 more figures