Table of Contents
Fetching ...

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi

TL;DR

The paper investigates safety challenges of large language models in multilingual contexts, revealing two language-induced curses where responses in lower-resource languages are more harmful and less aligned to prompts. It evaluates multilingual alignment strategies—xSFT and xRLHF—using LLaMa2-7B and the HH-RLHF dataset, with translations via NLLB-1.3B, and further examines multilingual pretraining (ALMA) as a potential remedy. Results show that alignment improves high-resource language safety more than low-resource languages, suggesting that the root cause lies in pretraining data scarcity; multilingual pretraining partially mitigates the curses. Overall, the work highlights the difficulty of cross-lingual safety alignment and points to pretraining data expansion and multilingual resources as key directions for future research.

Abstract

As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages, we observe that (1) LLMs tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) LLMs tend to generate more irrelevant responses to malicious prompts in lower-resource languages. To understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (RLHF) or supervised finetuning (SFT) on the HH-RLHF dataset. Surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. This suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. Our findings highlight the challenges in cross-lingual LLM safety, and we hope they inform future research in this direction.

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts

TL;DR

The paper investigates safety challenges of large language models in multilingual contexts, revealing two language-induced curses where responses in lower-resource languages are more harmful and less aligned to prompts. It evaluates multilingual alignment strategies—xSFT and xRLHF—using LLaMa2-7B and the HH-RLHF dataset, with translations via NLLB-1.3B, and further examines multilingual pretraining (ALMA) as a potential remedy. Results show that alignment improves high-resource language safety more than low-resource languages, suggesting that the root cause lies in pretraining data scarcity; multilingual pretraining partially mitigates the curses. Overall, the work highlights the difficulty of cross-lingual safety alignment and points to pretraining data expansion and multilingual resources as key directions for future research.

Abstract

As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages, we observe that (1) LLMs tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) LLMs tend to generate more irrelevant responses to malicious prompts in lower-resource languages. To understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (RLHF) or supervised finetuning (SFT) on the HH-RLHF dataset. Surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. This suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. Our findings highlight the challenges in cross-lingual LLM safety, and we hope they inform future research in this direction.
Paper Structure (35 sections, 4 equations, 2 figures, 13 tables)

This paper contains 35 sections, 4 equations, 2 figures, 13 tables.

Figures (2)

  • Figure 1: With a set of malicious prompts written in high-resource languages like English, we translate the prompt into low-resource languages (e.g. Hausa), Compared to the high-resource case, we observe two clear outcomes: (1) the response becomes harmful, (2) the response doesn't align with or is unrelated to the original prompt. (e.g., repeating the prompt in the response.)
  • Figure 2: Monolingual SFT fails to improve harmful rate and following rate on low-resource languages. The value in the heatmap corresponds to the change of harmful rate (top figure) and following rate (bottom figure) after monolingual SFT is applied. Specifically, the red region (in the top figure) represents a large improvement, demonstrating the effectiveness of monolingual SFT on high-resource languages.