Table of Contents
Fetching ...

Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, François Jacquenet

TL;DR

This survey analyzes routing strategies to optimise resource use in large language model–based systems, distinguishing pre-generation and post-generation routing while formalising routing as a performance-cost trade-off. It categorises methods into similarity-based, supervised, reinforcement learning-based, and generative routing, detailing their mechanisms, advantages, and limitations, and discusses industrial applicability and standardisation needs. Key contributions include a comprehensive taxonomy, evaluation considerations, and guidance for developing adaptive, low-cost routing in dynamic LLM ecosystems. The work highlights the potential of complementary model pools, inductive graphs, and autonomous routing to sustain performance while reducing financial, computational, and environmental costs. It also identifies pressing challenges and proposes benchmarks and future directions for advancing practical, scalable routing in real-world systems.

Abstract

Large Language Model (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component, such as conversational agents, are usually designed with monolithic, static architectures that rely on a single, general-purpose LLM to handle all user queries. However, these systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o, Claude-Sonnet) perform well across a wide range of tasks, they may incur significant financial, energy and computational costs. These costs may be disproportionate for simpler queries, resulting in unnecessary resource utilisation. A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models, thereby improving efficiency and optimising resource consumption. This survey aims to provide a comprehensive overview of routing strategies in LLM-based systems. Specifically, it reviews when, why, and how routing should be integrated into LLM pipelines to improve efficiency, scalability, and performance. We define the objectives to optimise, such as cost minimisation and performance maximisation, and discuss the timing of routing within the LLM workflow, whether it occurs before or after generation. We also detail the various implementation strategies, including similarity-based, supervised, reinforcement learning-based, and generative methods. Practical considerations such as industrial applications and current limitations are also examined, like standardising routing experiments, accounting for non-financial costs, and designing adaptive strategies. By formalising routing as a performance-cost optimisation problem, this survey provides tools and directions to guide future research and development of adaptive low-cost LLM-based systems.

Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

TL;DR

This survey analyzes routing strategies to optimise resource use in large language model–based systems, distinguishing pre-generation and post-generation routing while formalising routing as a performance-cost trade-off. It categorises methods into similarity-based, supervised, reinforcement learning-based, and generative routing, detailing their mechanisms, advantages, and limitations, and discusses industrial applicability and standardisation needs. Key contributions include a comprehensive taxonomy, evaluation considerations, and guidance for developing adaptive, low-cost routing in dynamic LLM ecosystems. The work highlights the potential of complementary model pools, inductive graphs, and autonomous routing to sustain performance while reducing financial, computational, and environmental costs. It also identifies pressing challenges and proposes benchmarks and future directions for advancing practical, scalable routing in real-world systems.

Abstract

Large Language Model (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component, such as conversational agents, are usually designed with monolithic, static architectures that rely on a single, general-purpose LLM to handle all user queries. However, these systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o, Claude-Sonnet) perform well across a wide range of tasks, they may incur significant financial, energy and computational costs. These costs may be disproportionate for simpler queries, resulting in unnecessary resource utilisation. A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models, thereby improving efficiency and optimising resource consumption. This survey aims to provide a comprehensive overview of routing strategies in LLM-based systems. Specifically, it reviews when, why, and how routing should be integrated into LLM pipelines to improve efficiency, scalability, and performance. We define the objectives to optimise, such as cost minimisation and performance maximisation, and discuss the timing of routing within the LLM workflow, whether it occurs before or after generation. We also detail the various implementation strategies, including similarity-based, supervised, reinforcement learning-based, and generative methods. Practical considerations such as industrial applications and current limitations are also examined, like standardising routing experiments, accounting for non-financial costs, and designing adaptive strategies. By formalising routing as a performance-cost optimisation problem, this survey provides tools and directions to guide future research and development of adaptive low-cost LLM-based systems.

Paper Structure

This paper contains 39 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Routing as pre-generation step -- Before generating an answer, each LLM's ability to provide an appropriate answer is assessed based on the user query complexity and/or topic. Dotted arrows represent non-selected LLM candidates. Rectangles with straight edges represent the router and routing candidates, while rectangles with rounded corners represent input/output components (user requests and results).
  • Figure 2: Routing as post-generation step (or cascade routing) -- The relevance of a larger LLM is determined by the evaluation of the answers generated by the current LLM. Each candidate response is evaluated sequentially. If an answer is deemed inadequate or untrustworthy, the user query is routed to a larger LLM. Typically, the cascade sequence is static. Rectangles with straight edges represent the router and routing candidates, while rectangles with rounded corners represent input/output components.