Table of Contents
Fetching ...

Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang, Zhaoxin Wang, Handing Wang

TL;DR

A novel, training-free alignment framework based on Sparse Weight Editing is proposed to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint.

Abstract

Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.

Multilingual Safety Alignment Via Sparse Weight Editing

TL;DR

A novel, training-free alignment framework based on Sparse Weight Editing is proposed to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint.

Abstract

Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.
Paper Structure (45 sections, 1 theorem, 30 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 45 sections, 1 theorem, 30 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

Define For $\lambda>0$, $\boldsymbol{Q}$ is positive definite and admits a Cholesky factorization $\boldsymbol{Q}=\boldsymbol{R}^{\top}\boldsymbol{R}$. Let be the optimal solution of Eq. eq:final_optimizationwithout the rank constraint. Then the optimal rank-$r$ perturbation $\Delta \boldsymbol{W}_{\mathcal{S}}^{*}$ for Eq. eq:final_optimization is where $\tilde{\boldsymbol{\Delta}}^{*}$ is the

Figures (2)

  • Figure 1: Impact of English Safety Neuron Amplification. Scaling the activations of English safety neurons leads to a consistent decrease in harmful response rates across multiple languages, validating the cross-lingual influence of these neurons.
  • Figure 2: Pairwise safety-neuron set overlap across languages. Higher values indicate greater overlap under this set-based measure. tend to exhibit stronger overlap, whereas show weaker overlap both with and with each other.

Theorems & Definitions (3)

  • Theorem 4.1: Low-Rank Safety Alignment
  • Definition 2.1: Magnitude-based Candidate Set
  • Definition 2.2: Significance-based Candidate Set