Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang; Hai Huang; Mingjie Li; Yage Zhang; Michael Backes; Yang Zhang

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang

TL;DR

This work uncovers a structural safety vulnerability in mixture-of-experts (MoE) LLMs by showing that unsafe outputs can be triggered by sparse routing configurations. It introduces Router Safety importance Score (RoSais) to identify safety-critical routers and uses RoSais-based methods to reveal unsafe routes; it then presents Fine-grained Unsafe Route Discovery (F-SOUR), which token-by-token and layer-by-layer optimizes routes to maximize unsafe outputs, achieving high attack success rates across multiple MoE families and benchmarks. The findings reveal that MoE safety is sparse and highly sensitive to routing, with RoSais-based attacks and F-SOUR achieving ASR up to about 0.90–0.98 on JailbreakBench and AdvBench, respectively. Defensive directions include route disabling at high-RoSais layers and safety-aware router training, offering practical avenues to bolster MoE safety alongside red-teaming efforts.

Abstract

By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

TL;DR

Abstract

to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

Paper Structure (25 sections, 8 equations, 6 figures, 14 tables)

This paper contains 25 sections, 8 equations, 6 figures, 14 tables.

Introduction
Background and Related Work
Dense and Sparse Models
LLM Safety
Threat Model
Sparse Safety in MoE LLMs
Router Safety Importance Score (RoSais)
RoSais-Based Unsafe Route Discovery
Experimental Setup
Experimental Results
Fine-Grained Unsafe Route Discovery
Our Proposed F-SOUR
Experimental Setup
Experimental Results
Defensive Perspectives
...and 10 more sections

Figures (6)

Figure 1: Illustrations of different model architectures.
Figure 2: RoSais-based unsafe route discovery.
Figure 3: Importance of routers for safety on JailbreakBench. (a) Sample-level: heatmap of the layer with the highest RoSias score for each question. (b) Dataset-level: average RoSais per layer, aggregated over the entire dataset. Both are shown per model (row).
Figure 4: Overview of F-SOUR.
Figure 5: Ablations on F-SOUR hyperparameters. (a) Impact of $S_3$ (maximum randomizations per token–layer pair). (b) Impact of $S_4$ (maximum attempts via the shadow judge).
...and 1 more figures

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

TL;DR

Abstract

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)