Table of Contents
Fetching ...

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah

TL;DR

The paper tackles the high cost of deploying large LLMs by introducing a hybrid inference framework that routes queries between a large and a small model based on predicted query difficulty and desired quality. It formulates a routing problem, proposes three router designs (deterministic, probabilistic, and probabilistic with data transformation) to handle LLM non-determinism, and demonstrates substantial cost savings (up to 40% fewer large-model calls) with minimal quality loss on MixInstruct. Key contributions include a novel tail-shift data transformation for imbalanced signal scenarios and extensive evaluation across model gaps, latency, and generalization to unseen model pairs. The approach offers a practical pathway for cost-efficient, scalable LLM services on edge-cloud platforms and MLaaS ecosystems.

Abstract

Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

TL;DR

The paper tackles the high cost of deploying large LLMs by introducing a hybrid inference framework that routes queries between a large and a small model based on predicted query difficulty and desired quality. It formulates a routing problem, proposes three router designs (deterministic, probabilistic, and probabilistic with data transformation) to handle LLM non-determinism, and demonstrates substantial cost savings (up to 40% fewer large-model calls) with minimal quality loss on MixInstruct. Key contributions include a novel tail-shift data transformation for imbalanced signal scenarios and extensive evaluation across model gaps, latency, and generalization to unseen model pairs. The approach offers a practical pathway for cost-efficient, scalable LLM services on edge-cloud platforms and MLaaS ecosystems.

Abstract

Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
Paper Structure (26 sections, 4 equations, 10 figures, 5 tables)

This paper contains 26 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: We use a dataset of natural language queries from a range of tasks like question answering, summarization, information extraction, etc. (See \ref{['sec:eval']} for details). We observe that (a) smaller models generally give poorer response quality or lower BART score yuan2021bartscore, (b) Llama-2 (13b) outperforms GPT-3.5-turbo on around $20\%$ examples, and (c) our router can make $22\%$ fewer calls to GPT-3.5-turbo (cost advantage) with $1\%$ drop in response quality (BART score).
  • Figure 2: Routing between edge and cloud.
  • Figure 3: Response quality distribution for FLAN-t5 (800m) and Llama-2 (13b) on the query "How to identify the index of median?" measured in BART scores. Llama-2 (13b) with transformation significantly overlaps with FLAN-t5 (800m).
  • Figure 4: Effect of data transformation on labels for training the router.
  • Figure 5: Error-cost tradeoffs achieved by $r_{det}$, $r_{prob}$, and $r_{trans}$ for different performance gaps.
  • ...and 5 more figures