Table of Contents
Fetching ...

Refusal Direction is Universal Across Safety-Aligned Languages

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank

TL;DR

This paper investigates the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages and uncovers the surprising cross-lingual universality of the refusal direction.

Abstract

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

Refusal Direction is Universal Across Safety-Aligned Languages

TL;DR

This paper investigates the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages and uncovers the surprising cross-lingual universality of the refusal direction.

Abstract

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

Paper Structure

This paper contains 37 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Compliance rates to harmful queries before and after ablating refusal vectors derived from English. Ablation leads to a substantial increase in compliance across all languages and models, indicating refusal direction derived from English transfers to other languages.
  • Figure 2: Compliance rates to harmful queries before and after ablating refusal vectors derived from 3 safety-aligned languages (zh, de, th). The ablation leads to near-total loss of refusal behavior across all languages and models, providing strong evidence for our universality hypothesis.
  • Figure 3: PCA visualizations of multilingual harmful and harmless representations in the refusal extraction layer. Top: Llama3.1-8B-Instruct. Middle: Qwen2.5-7B-Instruct. Bottom: gemma-2-9B-it. Arrows indicate refusal directions per language.
  • Figure 4: Cross-lingual cosine similarity between refusal directions and difference-in-means vectors across language pairs in Llama3.1-8B-Instruct. Each subplot compares the refusal direction of a source language extracted at token and layer position (pos, layer) with the difference-in-means vectors of a target language across all decoder layers. Brighter regions indicate higher similarity, with a consistent peak around layer 12, indicating aligned encoding of refusal signals across languages.
  • Figure 5: PCA visualizations of multilingual harmful and harmless representations in the refusal extraction layer. Top: Llama3.1-8B-Instruct. Middle: Qwen2.5-7B-Instruct. Bottom: gemma-2-9B-it. Arrows indicate refusal directions per language.
  • ...and 3 more figures