Table of Contents
Fetching ...

TensorOpera Router: A Multi-Model Router for Efficient LLM Inference

Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He

TL;DR

The paper addresses the challenge of balancing latency, cost, and accuracy when querying multiple LLMs by introducing TO-Router, a predictive routing system that selects the most cost-effective expert using embedding-based representations and soft-label training with a softmax temperature $T$ (set to $10$) over $E$ experts. It presents an end-to-end data preparation, training, and deployment pipeline, including a phase-wise workflow for generating soft labels from BERTSim and training MLP- or BERT-based routers. Empirical results across multiple domains show that a BERT-Router-based approach can achieve up to $30\%$ cost reduction, $40\%$ throughput improvement, and up to $11\%$ better BERTSim with a $\sim6\%$ NLL reduction, closely approaching the optimal trade-off. The work further demonstrates the potential for edge-to-cloud collaborative routing, enabling queries to be answered locally by small models when feasible, while still leveraging cloud-based experts when needed.

Abstract

With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query's requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40\%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.

TensorOpera Router: A Multi-Model Router for Efficient LLM Inference

TL;DR

The paper addresses the challenge of balancing latency, cost, and accuracy when querying multiple LLMs by introducing TO-Router, a predictive routing system that selects the most cost-effective expert using embedding-based representations and soft-label training with a softmax temperature (set to ) over experts. It presents an end-to-end data preparation, training, and deployment pipeline, including a phase-wise workflow for generating soft labels from BERTSim and training MLP- or BERT-based routers. Empirical results across multiple domains show that a BERT-Router-based approach can achieve up to cost reduction, throughput improvement, and up to better BERTSim with a NLL reduction, closely approaching the optimal trade-off. The work further demonstrates the potential for edge-to-cloud collaborative routing, enabling queries to be answered locally by small models when feasible, while still leveraging cloud-based experts when needed.

Abstract

With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query's requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40\%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.
Paper Structure (16 sections, 5 equations, 7 figures, 2 tables)

This paper contains 16 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: TO-Router system's overview of router data preparation, router model training and deployment pipelines.
  • Figure 2: Router performance per dataset: BERT similarity score.
  • Figure 3: Router performance per dataset: Negative Log-Likelihood.
  • Figure 4: A holistic view of model performance, throughput and total querying cost for standalone deployed expert models and different routing methods.
  • Figure 5: Answering queries locally on the edge through an SLM or proxying to the cloud.
  • ...and 2 more figures