Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding
Haneul Yoo, Yongjin Yang, Hwaran Lee
TL;DR
CSRT introduces a code-switching red-teaming framework to evaluate LLM safety and multilingual understanding. By auto-generating cross-lingual prompts across ten languages and ten models, CSRT achieves substantially higher attack rates than English-only baselines and reveals that more languages and lower-resource languages intensify unsafe responses. The study provides extensive ablations on language count and resource availability, demonstrates cross-language comprehension capabilities of modern LLMs, and discusses limitations of current defenses against multilingual prompts. These findings offer a practical, scalable approach for robust multilingual safety evaluation and illuminate how language resources influence safety alignment in LLMs.
Abstract
As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize codeswitching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand codeswitching texts. Additionally, we validate the extensibility of the CSRT by generating codeswitching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.
