Table of Contents
Fetching ...

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Jona te Lintelo, Lichao Wu, Stjepan Picek

TL;DR

The paper exposes a structural vulnerability in Mixture-of-Experts LLMs: safety-sensitive behavior tends to concentrate in a small subset of local experts, making it susceptible to a training-free attack. L^3 identifies safety-critical experts by analyzing sequential routing traces with an LSTM-based classifier and gradient-based attribution on a twin benign/malicious dataset. It then adaptively silences the top-ranked safety experts during inference, achieving a substantial increase in Attack Success Rate (ASR) across eight MoE LLMs while preserving much of the model's general language utility. The findings highlight a key trade-off between MoE efficiency and robustness of safety alignment, and they motivate developing architecture- and routing-aware defenses that distribute safety capabilities beyond a small set of experts. The work provides a practical, reproducible methodology and open-source code to assess and strengthen MoE safety against expert-silencing jailbreaks.

Abstract

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L$^3$), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L$^3$ learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L$^3$ on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

TL;DR

The paper exposes a structural vulnerability in Mixture-of-Experts LLMs: safety-sensitive behavior tends to concentrate in a small subset of local experts, making it susceptible to a training-free attack. L^3 identifies safety-critical experts by analyzing sequential routing traces with an LSTM-based classifier and gradient-based attribution on a twin benign/malicious dataset. It then adaptively silences the top-ranked safety experts during inference, achieving a substantial increase in Attack Success Rate (ASR) across eight MoE LLMs while preserving much of the model's general language utility. The findings highlight a key trade-off between MoE efficiency and robustness of safety alignment, and they motivate developing architecture- and routing-aware defenses that distribute safety capabilities beyond a small set of experts. The work provides a practical, reproducible methodology and open-source code to assess and strengthen MoE safety against expert-silencing jailbreaks.

Abstract

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.
Paper Structure (31 sections, 8 equations, 7 figures, 9 tables)

This paper contains 31 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An overview of the L$^3$ framework.
  • Figure 2: ASR achieved at silencing certain percentages of all local experts.
  • Figure 3: The summed safety score per global expert. Experts are sorted from high to low by their summed safety score.
  • Figure 4: The summed safety score per layer. Layers are sorted from low to high by their index.
  • Figure 5: Distributions of the difference in expert occurrence for the first and last token in all malicious prompts and their benign counterparts. The top image shows the difference in expert counts for the first token, aggregated over all prompts. The bottom image shows the difference in expert counts for the last token, aggregated over all prompts.
  • ...and 2 more figures