Table of Contents
Fetching ...

How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities

Aly M. Kassem, Bernhard Schölkopf, Zhijing Jin

TL;DR

The paper critically evaluates LLM routing systems, arguing that existing benchmarks encourage category-based heuristics rather than true query-efficient routing. It introduces the DSC-Benchmark (Diverse, Simple, Categorized) to probe performance across coding, math, translation, human instructions, privacy, and safety, including backdoor-adversary tests. Through extensive case studies on open-source and commercial routers, the work reveals systematic misrouting of simple tasks, heightened safety risks when using weaker models, and limited gains from preference-data approaches. The findings call for more robust, privacy-aware evaluation frameworks and smarter routing strategies that account for query complexity as well as category-specific dynamics. This work provides a foundation for improving routing robustness in real-world deployments.

Abstract

Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance by dynamically assigning queries to the most appropriate model based on query complexity. Despite recent advances showing that preference-data-based routers can outperform traditional methods, current evaluation benchmarks remain limited. They largely focus on general model capabilities while overlooking task-specific behaviors and critical concerns such as privacy, safety, and potential backdoor vulnerabilities introduced through preference data. In response, we propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types, including coding, translation, mathematics, human instructions, general knowledge, and LLM jailbreaking. Additionally, it integrates privacy and safety assessments to reveal hidden risks. Our experiments on three preference-based routers and two commercial counterparts demonstrate that while these systems improve efficiency, they often make suboptimal, category-driven decisions. For instance, a BERT-based router directs all coding and mathematics queries to the most powerful LLM even when simpler models would suffice, while routing jailbreaking attempts to weaker models, thereby elevating safety risks.

How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities

TL;DR

The paper critically evaluates LLM routing systems, arguing that existing benchmarks encourage category-based heuristics rather than true query-efficient routing. It introduces the DSC-Benchmark (Diverse, Simple, Categorized) to probe performance across coding, math, translation, human instructions, privacy, and safety, including backdoor-adversary tests. Through extensive case studies on open-source and commercial routers, the work reveals systematic misrouting of simple tasks, heightened safety risks when using weaker models, and limited gains from preference-data approaches. The findings call for more robust, privacy-aware evaluation frameworks and smarter routing strategies that account for query complexity as well as category-specific dynamics. This work provides a foundation for improving routing robustness in real-world deployments.

Abstract

Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance by dynamically assigning queries to the most appropriate model based on query complexity. Despite recent advances showing that preference-data-based routers can outperform traditional methods, current evaluation benchmarks remain limited. They largely focus on general model capabilities while overlooking task-specific behaviors and critical concerns such as privacy, safety, and potential backdoor vulnerabilities introduced through preference data. In response, we propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types, including coding, translation, mathematics, human instructions, general knowledge, and LLM jailbreaking. Additionally, it integrates privacy and safety assessments to reveal hidden risks. Our experiments on three preference-based routers and two commercial counterparts demonstrate that while these systems improve efficiency, they often make suboptimal, category-driven decisions. For instance, a BERT-based router directs all coding and mathematics queries to the most powerful LLM even when simpler models would suffice, while routing jailbreaking attempts to weaker models, thereby elevating safety risks.

Paper Structure

This paper contains 25 sections, 5 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: An illustration of the proposed benchmark, featuring diverse, straightforward, and categorized subsets of tasks, evaluated using three open-source and two closed-source routers.
  • Figure 2: Benchmark Categorization among various sources.
  • Figure 3: Illustrative examples of the benchmark samples from code, math, and safety subsets. All the examples are routed to the Strong LLM (GPT-4o).
  • Figure 4: Similarity between training data (arena, judge) and the benchmark subsets.
  • Figure 5: Routing results on MT-Bench across eight different categories, which shows that most, if not all, of the math and coding queries, are routed to the GPT-4o (strong LLM).
  • ...and 6 more figures