Table of Contents
Fetching ...

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay

TL;DR

RouterBench addresses the need for a standardized benchmark to evaluate multi-LLM routing systems by introducing a large, diverse dataset (over 405k inferences across 11 models and 8 tasks) and a theoretical framework that balances inference cost with performance using the cost-quality plane. It defines metrics and operations, including linear interpolation, non-decreasing convex hulls, extrapolation, and AIQ, to compare routers regardless of their internal design. The paper experiments with predictive and non-predictive routers, demonstrating that while predictive and cascading approaches can match or exceed individual LLMs at lower costs, the Oracle router remains the strongest baseline, highlighting opportunities for further routing innovations. By providing RouterBench and open-source code, the work aims to accelerate cost-efficient, scalable deployments of LLMs in real-world applications and establish a standard for router evaluation.

Abstract

As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.

RouterBench: A Benchmark for Multi-LLM Routing System

TL;DR

RouterBench addresses the need for a standardized benchmark to evaluate multi-LLM routing systems by introducing a large, diverse dataset (over 405k inferences across 11 models and 8 tasks) and a theoretical framework that balances inference cost with performance using the cost-quality plane. It defines metrics and operations, including linear interpolation, non-decreasing convex hulls, extrapolation, and AIQ, to compare routers regardless of their internal design. The paper experiments with predictive and non-predictive routers, demonstrating that while predictive and cascading approaches can match or exceed individual LLMs at lower costs, the Oracle router remains the strongest baseline, highlighting opportunities for further routing innovations. By providing RouterBench and open-source code, the work aims to accelerate cost-efficient, scalable deployments of LLMs in real-world applications and establish a standard for router evaluation.

Abstract

As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.
Paper Structure (28 sections, 11 equations, 9 figures, 1 table)

This paper contains 28 sections, 11 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Left: The RouterBench Construction Process integrates eight datasets with eleven distinct models to develop RouterBench. Detailed format can be found in Appendix \ref{['appendix:data_entry']}. Right: The Model Routing Process shows the method of routing prompts through a router to various LLMs based on specific requests, demonstrating the dynamic allocation of resources.
  • Figure 2: Left: linear interpolation is the process of achieving the cost-performance trade-off between any concrete routers. Point A and B are routers with different input parameters. To achieve the average of A and B, we build router C which routes to A or B with 50% probability each, and it performs the average of A and B in expectation. Middle: Consider points A to E, we can construct the non-decreasing convex hull consisting of points A, B, and C. D and E as they can be replaced by a strictly superior affine combination of A, B, and C. Right: ABC and DEF are two routing systems (already convexified with ABC extrapolated to (0.1,0) for a fair comparison). To compare, we interpolate A and B to $c_{min}=0.1$ and $c_{max}=0.8$, respectively, and then calculate the area under the curve normalized by $c_{max}-c_{min}$ to derive AIQ.
  • Figure 3: Left: Accuracy vs Total Cost of all the $11$ LLMs on RouterBench. Right: The Oracle LLMs selection frequency across the $7$ subsets in RouterBench.
  • Figure 4: Total Cost vs. Performance for eleven models and KNN, MLP, and Zero routers on RouterBench except for MT-Bench. For KNN and MLP, we tested different hyper-parameters, and the optimal results are displayed above. The AIQ values are calculated for all $3$ routers. NDCH stands for non-decreasing convex hull, represented by the solid lines. Dotted lines connect points with increasing willingness to pay.
  • Figure 5: Total Cost vs Performance for eleven models and cascading routers on MMLU, MBPP, and GSM8K. Different error rates are tested, and the AIQ value is computed for Zero Router and zero error rate cascading router. The solid lines represent the non-decreasing convex hull and the dotted line represents points with increasing the maximum cost parameter.
  • ...and 4 more figures