Rerouting LLM Routers
Avital Shafran, Roei Schuster, Thomas Ristenpart, Vitaly Shmatikov
TL;DR
Rerouting LLM Routers formalizes LLM control planes and defines integrity as the resistance of the inference-flow transcript $T$ to adversarial modifications. It introduces confounder gadgets—query-independent token prefixes—that, when prepended to any query, push the router to route to the strong model $M_\mathtt{s}$ with high probability in both white-box and black-box settings. The study evaluates open-source routers and several commercial services, showing high upgrade rates and that query quality, as measured by $\text{perplexity}$ and benchmark results (MT-Bench, MMLU, GSM8K), is not harmed and can even improve when a larger gap exists between $M_\mathtt{s}$ and $M_\mathtt{w}$. Perplexity-based defenses are imperfect, as adversaries can optimize gadgets for low perplexity, motivating exploration of higher-level defenses such as anomalous workload detection. Overall, the work reveals vulnerabilities in LLM control planes and emphasizes the need for robust integrity guarantees in routing to prevent cost inflation and ensure reliable, cost-effective LLM deployments.
Abstract
LLM routers aim to balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. Routers represent one type of what we call LLM control planes: systems that orchestrate use of one or more LLMs. In this paper, we investigate routers' adversarial robustness. We first define LLM control plane integrity, i.e., robustness of LLM orchestration to adversarial inputs, as a distinct problem in AI safety. Next, we demonstrate that an adversary can generate query-independent token sequences we call ``confounder gadgets'' that, when added to any query, cause LLM routers to send the query to a strong LLM. Our quantitative evaluation shows that this attack is successful both in white-box and black-box settings against a variety of open-source and commercial routers, and that confounding queries do not affect the quality of LLM responses. Finally, we demonstrate that gadgets can be effective while maintaining low perplexity, thus perplexity-based filtering is not an effective defense. We finish by investigating alternative defenses.
