Table of Contents
Fetching ...

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang

TL;DR

RASA addresses a fundamental weakness in safety alignment for MoE models: full-parameter fine-tuning can achieve apparent safety via routing shortcuts without repairing unsafe experts. By identifying Safety-Critical Experts through adversarial activation discrepancies and alternately performing selective expert fine-tuning under fixed routing followed by router-consistency optimization, RASA delivers robust defense against jailbreaks while preserving general capabilities. The method proves data-efficient, effective across multiple jailbreaks and multi-turn scenarios, and outperforms inference-time steering baselines by repairing the source of unsafe behavior. This architecture-aware, expert-level intervention offers a practical, scalable pathway to robust MoE safety in real-world applications.

Abstract

Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

TL;DR

RASA addresses a fundamental weakness in safety alignment for MoE models: full-parameter fine-tuning can achieve apparent safety via routing shortcuts without repairing unsafe experts. By identifying Safety-Critical Experts through adversarial activation discrepancies and alternately performing selective expert fine-tuning under fixed routing followed by router-consistency optimization, RASA delivers robust defense against jailbreaks while preserving general capabilities. The method proves data-efficient, effective across multiple jailbreaks and multi-turn scenarios, and outperforms inference-time steering baselines by repairing the source of unsafe behavior. This architecture-aware, expert-level intervention offers a practical, scalable pathway to robust MoE safety in real-world applications.

Abstract

Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.
Paper Structure (30 sections, 9 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 9 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Alignment shortcut in MoE safety training. Illustration of a failure mode where unsafe inputs maybe routed to already-aligned experts, allowing safety loss to decrease without correcting Safety-Critical Experts, which remain bypassed and unaligned.
  • Figure 2: Overview of RASA: RASA alternates between selectively fine-tuning Safety-Critical Experts under fixed routing and optimizing router consistency using safe anchor distributions, preventing routing-based shortcut solutions.
  • Figure 3: Effect of expert selection and training configuration. Impact of (a) training rounds, (b) top-k safety-critical expert selection, and (c) adversarial data ratio on safety, general performance, and over-refusal across MoE architectures.