SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li
TL;DR
SAFEx identifies a distinct MoE-specific safety risk, termed positional vulnerability, where safety-aligned behaviors depend on a small subset of routing experts. It introduces a Stability-based Expert Selection workflow to robustly identify safety-critical experts and then categorizes them into Harmful Content Detection (HCDG) and Harmful Response Control (HRCG) groups. Through linear probing and expert masking experiments across several MoE LLMs, including Qwen3-30B-A3B, SAFEx demonstrates that perturbing or masking a small fraction of experts substantially weakens model safety under harmful prompts, without full-model retraining. The work additionally shows that lightweight LoRA-based weight merging targeting these experts can improve safety alignment under jailbreak conditions, offering a compute-efficient, deployment-friendly defense. Overall, SAFEx provides a framework for MoE-aware safety analysis and targeted expert-level interventions that could guide future robustness improvements in routed architectures.
Abstract
Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety responses. Expert-level interventions are conducted to probe causality and to test mitigation. Targeted masking of SAFEx-selected experts reveals that safety behavior is highly concentrated. On Qwen3-30B-A3B, configured with 48 MoE-FFN layers and 128 experts per layer under top-8 routing (48x128=6,144 experts in total), disabling 12 selected experts reduces the refusal rate by 22%. In addition, lightweight adaptation is performed using LoRA under three configurations-the HRCG, the union of HCDG and HRCG, and all experts-and the resulting updates are composed through negative weight merging targeted at the HRCG, leading to improved refusal under adversarial prompts without full-model retraining. These results establish positional vulnerability as a distinct MoE-specific safety challenge and provide a practical, compute-efficient pathway for expert-level safety interventions within routed architectures.
