Table of Contents
Fetching ...

BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, Daoyuan Wu

TL;DR

MoE LLMs enable scalable parameter growth via sparse routing but introduce new security vulnerabilities. BadMoE presents a three-stage backdoor: identify dormant experts, optimize routing-triggered activation with a perplexity constraint, and poison data to make dormant experts dominate outputs, backed by a theoretical proof of dominating experts. Empirically, BadMoE achieves high attack success rates (often >95%) while preserving clean-task utility and showing robustness to domain shifts, prompt variations, and existing defenses. The work underscores critical security concerns for MoE architectures and motivates future defenses and robust evaluation frameworks for MoE-based LLMs.

Abstract

Mixture-of-Experts (MoE) have emerged as a powerful architecture for large language models (LLMs), enabling efficient scaling of model capacity while maintaining manageable computational costs. The key advantage lies in their ability to route different tokens to different ``expert'' networks within the model, enabling specialization and efficient handling of diverse input. However, the vulnerabilities of MoE-based LLMs still have barely been studied, and the potential for backdoor attacks in this context remains largely unexplored. This paper presents the first backdoor attack against MoE-based LLMs where the attackers poison ``dormant experts'' (i.e., underutilized experts) and activate them by optimizing routing triggers, thereby gaining control over the model's output. We first rigorously prove the existence of a few ``dominating experts'' in MoE models, whose outputs can determine the overall MoE's output. We also show that dormant experts can serve as dominating experts to manipulate model predictions. Accordingly, our attack, namely BadMoE, exploits the unique architecture of MoE models by 1) identifying dormant experts unrelated to the target task, 2) constructing a routing-aware loss to optimize the activation triggers of these experts, and 3) promoting dormant experts to dominating roles via poisoned training data. Extensive experiments show that BadMoE successfully enforces malicious prediction on attackers' target tasks while preserving overall model utility, making it a more potent and stealthy attack than existing methods.

BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

TL;DR

MoE LLMs enable scalable parameter growth via sparse routing but introduce new security vulnerabilities. BadMoE presents a three-stage backdoor: identify dormant experts, optimize routing-triggered activation with a perplexity constraint, and poison data to make dormant experts dominate outputs, backed by a theoretical proof of dominating experts. Empirically, BadMoE achieves high attack success rates (often >95%) while preserving clean-task utility and showing robustness to domain shifts, prompt variations, and existing defenses. The work underscores critical security concerns for MoE architectures and motivates future defenses and robust evaluation frameworks for MoE-based LLMs.

Abstract

Mixture-of-Experts (MoE) have emerged as a powerful architecture for large language models (LLMs), enabling efficient scaling of model capacity while maintaining manageable computational costs. The key advantage lies in their ability to route different tokens to different ``expert'' networks within the model, enabling specialization and efficient handling of diverse input. However, the vulnerabilities of MoE-based LLMs still have barely been studied, and the potential for backdoor attacks in this context remains largely unexplored. This paper presents the first backdoor attack against MoE-based LLMs where the attackers poison ``dormant experts'' (i.e., underutilized experts) and activate them by optimizing routing triggers, thereby gaining control over the model's output. We first rigorously prove the existence of a few ``dominating experts'' in MoE models, whose outputs can determine the overall MoE's output. We also show that dormant experts can serve as dominating experts to manipulate model predictions. Accordingly, our attack, namely BadMoE, exploits the unique architecture of MoE models by 1) identifying dormant experts unrelated to the target task, 2) constructing a routing-aware loss to optimize the activation triggers of these experts, and 3) promoting dormant experts to dominating roles via poisoned training data. Extensive experiments show that BadMoE successfully enforces malicious prediction on attackers' target tasks while preserving overall model utility, making it a more potent and stealthy attack than existing methods.

Paper Structure

This paper contains 27 sections, 15 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: An illustration of our BadMoE attack on sentiment classification task. For clarity, we assume that only one expert is activated at each time step.
  • Figure 2: Comparison of the architecture of dense LLMs (left) and MoE-based LLMs (right).
  • Figure 3: An overview of our proposed BadMoE (best viewed in color). For convenience, we assume that only one adversarial expert (i.e., r E1) exists in MoE layer.
  • Figure 4: Matrix heat maps of expert usage on the AGNews dataset, where darker colors indicate less usage and lighter colors indicate more.
  • Figure 5: Ablation studies on hyper-parameter settings of BadMoE.
  • ...and 9 more figures