Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

Haneul Yoo; Yongjin Yang; Hwaran Lee

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

Haneul Yoo, Yongjin Yang, Hwaran Lee

TL;DR

CSRT introduces a code-switching red-teaming framework to evaluate LLM safety and multilingual understanding. By auto-generating cross-lingual prompts across ten languages and ten models, CSRT achieves substantially higher attack rates than English-only baselines and reveals that more languages and lower-resource languages intensify unsafe responses. The study provides extensive ablations on language count and resource availability, demonstrates cross-language comprehension capabilities of modern LLMs, and discusses limitations of current defenses against multilingual prompts. These findings offer a practical, scalable approach for robust multilingual safety evaluation and illuminate how language resources influence safety alignment in LLMs.

Abstract

As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize codeswitching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand codeswitching texts. Additionally, we validate the extensibility of the CSRT by generating codeswitching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

TL;DR

Abstract

Paper Structure (55 sections, 4 figures, 12 tables)

This paper contains 55 sections, 4 figures, 12 tables.

Introduction
Related Work
Code-Switching
Red-Teaming LLMs
Code-Switching Red-Teaming
Experiments
Experimental Setup
Evaluation Models
Evaluation Metrics
Attack Success Rate (ASR)
Refusal Rate (RR)
Comprehension (Cmp.)
Sample-level Analysis
Attack Baselines
Evaluation Results
...and 40 more sections

Figures (4)

Figure 1: Example of the CSRT query. Responses of OpenAI's gpt-4o across three user prompts delivering the same meaning: in English, in Korean, and in code-switching (ours). The CSRT enables LLM evaluation in terms of both safety and multilingual understanding.
Figure 2:
Figure 3: Ablation experimental results (ASR) with various combinations of input languages to generate code-switching red-teaming queries.
Figure 4: Evaluation results on different sizes of LLMs.

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

TL;DR

Abstract

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)