Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song, Yuheng Huang, Zhehua Zhou, Lei Ma
TL;DR
The paper introduces Multilingual Blending to evaluate LLM safety alignment under mixed-language inputs, addressing the gap that safety research is predominantly English-centric. It analyzes external patterns (Number of Languages, Resource Level) and internal linguistic patterns (Morphology, Language Family) alongside uncertainty analysis to understand bypass mechanisms. Empirical results show substantial safety bypass in mixed-language settings, with bypass rates reaching up to $67.23\%$ on GPT-3.5 and $40.34\%$ on GPT-4o in some configurations, and average rates exceeding $22\%$ across evaluated models, while uncertainty in outputs roughly doubles under mixing. These findings highlight the need for safety alignment approaches that account for multilingual context and cross-language generalization to preserve trustworthy LLM behavior in real-world multilingual use.
Abstract
As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly focused on safeguarding and aligning LLM behaviors with human preferences and ethical standards. LLMs, trained on extensive multilingual corpora, exhibit powerful generalization abilities across diverse languages and domains. However, current safety alignment practices predominantly focus on single-language scenarios, which leaves their effectiveness in complex multilingual contexts, especially for those complex mixed-language formats, largely unexplored. In this study, we introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various state-of-the-art LLMs (e.g., GPT-4o, GPT-3.5, Llama3) under sophisticated, multilingual conditions. We further investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending in compromising the safeguards of LLMs. Our experimental results show that, without meticulously crafted prompt templates, Multilingual Blending significantly amplifies the detriment of malicious queries, leading to dramatically increased bypass rates in LLM safety alignment (67.23% on GPT-3.5 and 40.34% on GPT-4o), far exceeding those of single-language baselines. Moreover, the performance of Multilingual Blending varies notably based on intrinsic linguistic properties, with languages of different morphology and from diverse families being more prone to evading safety alignments. These findings underscore the necessity of evaluating LLMs and developing corresponding safety alignment strategies in a complex, multilingual context to align with their superior cross-language generalization capabilities.
