Table of Contents
Fetching ...

Stealing User Prompts from Mixture of Experts

Itay Yona, Ilia Shumailov, Jamie Hayes, Nicholas Carlini

TL;DR

This paper shows how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt.

Abstract

Mixture-of-Experts (MoE) models improve the efficiency and scalability of dense language models by routing each token to a small number of experts in each layer. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt. We successfully demonstrate the effectiveness of this attack on a two-layer Mixtral model, exploiting the tie-handling behavior of the torch.topk CUDA implementation. Our results show that we can extract the entire prompt using $O({VM}^2)$ queries (with vocabulary size $V$ and prompt length $M$) or 100 queries on average per token in the setting we consider. This is the first attack to exploit architectural flaws for the purpose of extracting user prompts, introducing a new class of LLM vulnerabilities.

Stealing User Prompts from Mixture of Experts

TL;DR

This paper shows how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt.

Abstract

Mixture-of-Experts (MoE) models improve the efficiency and scalability of dense language models by routing each token to a small number of experts in each layer. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim's queries can exploit Expert-Choice-Routing to fully disclose a victim's prompt. We successfully demonstrate the effectiveness of this attack on a two-layer Mixtral model, exploiting the tie-handling behavior of the torch.topk CUDA implementation. Our results show that we can extract the entire prompt using queries (with vocabulary size and prompt length ) or 100 queries on average per token in the setting we consider. This is the first attack to exploit architectural flaws for the purpose of extracting user prompts, introducing a new class of LLM vulnerabilities.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: High-level outline of the MoE Tiebreak Leakage Attack. The attacker and victim queries are batched together, affecting routing of each other. The attacker systematically guesses the next token in a victim's confidential message. A correct guess triggers Expert-Choice-Routing tie-breaker, leaving a detectable signal in the model's output.
  • Figure 2: Step-by-step execution of the MoE Tiebreak Leakage attack. More details are provided in \ref{['ssec:attackoverview']}.
  • Figure 3: The adversarial batch consists of four components: (1) secret message that attacker tries to leak, contains an already known prefix a target token and an unknown suffix. (2) probe input, an attacker controlled sequence in which the known prefix and a guess for the target token are being sent. It aims to induce ties in Expert-Choice-Routing, and for its returned output to be examined by the attacker for verification of correct guesses. (3) blocking sequences, a set of attacker controller inputs that aim to deprioritize the target and guess token, such that they will be placed at the edge of an expert buffer. (4) padding sequence, an attacker controlled arbitrarly long sequence aims to extend the expert capacity (expert buffer length).
  • Figure 4: The goal of the adversarial batch from \ref{['fig:adversarial_batch']} is to carefully shape the expert buffers. This figure illustrates how the expert buffers looks like under successful exploitation that requires the knowledge of the target token and its position in the expert buffer. In this setting a change in the relative order of the secret message and the probe sequence will effect the routing decision and therefore the processing of the target token and the guess token as they are ties placed exactly at the edge of the expert buffer.
  • Figure 5: Attack performance of victim messages of different sizes per padding sequence length. The plot indicates there is a trade-off between padding sequence length and success rate, and that the attack becomes harder with the length of the secret message.
  • ...and 2 more figures