Table of Contents
Fetching ...

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son

TL;DR

MoE LLM safety hinges on routing to safety-critical experts, but fine-tuning causes harmful-input routing to drift, undermining safety. SafeMoE introduces a routing-drift regularization based on KL-divergence on harmful-input routing distributions, implemented via a bi-level greedy optimization to align a fine-tuned model with the safety-aligned baseline. Across eight MoE LLMs spanning 7B–141B parameters, SafeMoE dramatically reduces harmfulness scores under harmful fine-tuning with minimal task-utility loss (≈1%) and low overhead (~2%), outperforming existing defenses. This architecture-aware defense demonstrates practical robustness for real-world fine-tuning services and highlights the importance of preserving architecture-specific safety mechanisms in MoE LLMs.

Abstract

Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

TL;DR

MoE LLM safety hinges on routing to safety-critical experts, but fine-tuning causes harmful-input routing to drift, undermining safety. SafeMoE introduces a routing-drift regularization based on KL-divergence on harmful-input routing distributions, implemented via a bi-level greedy optimization to align a fine-tuned model with the safety-aligned baseline. Across eight MoE LLMs spanning 7B–141B parameters, SafeMoE dramatically reduces harmfulness scores under harmful fine-tuning with minimal task-utility loss (≈1%) and low overhead (~2%), outperforming existing defenses. This architecture-aware defense demonstrates practical robustness for real-world fine-tuning services and highlights the importance of preserving architecture-specific safety mechanisms in MoE LLMs.

Abstract

Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

Paper Structure

This paper contains 22 sections, 4 equations, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: Effectiveness of defenses against HFT attacks.
  • Figure 2: MoE LLM architecture.
  • Figure 3: Safety routing drift and harmfulness of MoE LLMs over training steps. Results of t-tests for Pearson correlation coefficients are reported ($r$: correlation coefficient, $p$: $p$-value).
  • Figure 4: Overview of SafeMoE. It mitigates the safety routing drift by directly constraining this drift during fine-tuning, thereby effectively safeguarding MoE LLMs against HFT attacks.
  • Figure 5: Training dynamics of vanilla fine-tuning vs. SafeMoE (OLMoE on SAMSum).
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 3.1: Safety routing drift