Table of Contents
Fetching ...

Rerouting LLM Routers

Avital Shafran, Roei Schuster, Thomas Ristenpart, Vitaly Shmatikov

TL;DR

Rerouting LLM Routers formalizes LLM control planes and defines integrity as the resistance of the inference-flow transcript $T$ to adversarial modifications. It introduces confounder gadgets—query-independent token prefixes—that, when prepended to any query, push the router to route to the strong model $M_\mathtt{s}$ with high probability in both white-box and black-box settings. The study evaluates open-source routers and several commercial services, showing high upgrade rates and that query quality, as measured by $\text{perplexity}$ and benchmark results (MT-Bench, MMLU, GSM8K), is not harmed and can even improve when a larger gap exists between $M_\mathtt{s}$ and $M_\mathtt{w}$. Perplexity-based defenses are imperfect, as adversaries can optimize gadgets for low perplexity, motivating exploration of higher-level defenses such as anomalous workload detection. Overall, the work reveals vulnerabilities in LLM control planes and emphasizes the need for robust integrity guarantees in routing to prevent cost inflation and ensure reliable, cost-effective LLM deployments.

Abstract

LLM routers aim to balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. Routers represent one type of what we call LLM control planes: systems that orchestrate use of one or more LLMs. In this paper, we investigate routers' adversarial robustness. We first define LLM control plane integrity, i.e., robustness of LLM orchestration to adversarial inputs, as a distinct problem in AI safety. Next, we demonstrate that an adversary can generate query-independent token sequences we call ``confounder gadgets'' that, when added to any query, cause LLM routers to send the query to a strong LLM. Our quantitative evaluation shows that this attack is successful both in white-box and black-box settings against a variety of open-source and commercial routers, and that confounding queries do not affect the quality of LLM responses. Finally, we demonstrate that gadgets can be effective while maintaining low perplexity, thus perplexity-based filtering is not an effective defense. We finish by investigating alternative defenses.

Rerouting LLM Routers

TL;DR

Rerouting LLM Routers formalizes LLM control planes and defines integrity as the resistance of the inference-flow transcript to adversarial modifications. It introduces confounder gadgets—query-independent token prefixes—that, when prepended to any query, push the router to route to the strong model with high probability in both white-box and black-box settings. The study evaluates open-source routers and several commercial services, showing high upgrade rates and that query quality, as measured by and benchmark results (MT-Bench, MMLU, GSM8K), is not harmed and can even improve when a larger gap exists between and . Perplexity-based defenses are imperfect, as adversaries can optimize gadgets for low perplexity, motivating exploration of higher-level defenses such as anomalous workload detection. Overall, the work reveals vulnerabilities in LLM control planes and emphasizes the need for robust integrity guarantees in routing to prevent cost inflation and ensure reliable, cost-effective LLM deployments.

Abstract

LLM routers aim to balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. Routers represent one type of what we call LLM control planes: systems that orchestrate use of one or more LLMs. In this paper, we investigate routers' adversarial robustness. We first define LLM control plane integrity, i.e., robustness of LLM orchestration to adversarial inputs, as a distinct problem in AI safety. Next, we demonstrate that an adversary can generate query-independent token sequences we call ``confounder gadgets'' that, when added to any query, cause LLM routers to send the query to a strong LLM. Our quantitative evaluation shows that this attack is successful both in white-box and black-box settings against a variety of open-source and commercial routers, and that confounding queries do not affect the quality of LLM responses. Finally, we demonstrate that gadgets can be effective while maintaining low perplexity, thus perplexity-based filtering is not an effective defense. We finish by investigating alternative defenses.
Paper Structure (55 sections, 5 equations, 7 figures, 16 tables)

This paper contains 55 sections, 5 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: LLM routers classify queries and route complex ones to an expensive/strong model, others to a cheaper/weak model. To control costs, LLM routers can be calibrated to maintain (for an expected workload) a specific ratio between queries sent to the strong and weak models.
  • Figure 2: Overview of our attack on LLM routing control plane integrity. The attack adds to each query a prefix (represented by the gear), called a "confounder gadget," that causes the router to send the query to the strong model.
  • Figure 3: Summary of our setup for routers, underlying LLMs, and benchmark datasets used in the experiments.
  • Figure 4: Convergence of gadget generation against different routing algorithms.
  • Figure 5: Perplexity of the original queries in the GSM8K benchmark compared to the perplexity of confounded queries using a single uniformly sampled gadget. We additionally present the ROC curve of the defense that detects confounded queries by checking if they cross a perplexity threshold, and it's corresponding ROCAUC score. Confounded queries have significantly higher perplexity values, and are thus easy to recognize and filter out.
  • ...and 2 more figures