Table of Contents
Fetching ...

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar

TL;DR

No single open-source LLM dominates across benchmarks, motivating a routing approach to assign each input to the most suitable model. The authors develop sparse LLM routing with classifier-based and clustering-based strategies and evaluate them on GSM8K and MMLU using a diverse pool of LLMs, defining oracle and classifier-based upper bounds to quantify potential gains. Results show that routing can outperform weak LLMs but generally cannot surpass the top-performing LLM due to limited training data, while incurring latency trade-offs. The work highlights the feasibility and limitations of LLM routing, and points to data, modeling, and policy improvements as avenues for future gains in efficient, highly accurate utilization of multiple LLMs.

Abstract

With the rapid development of LLMs, it is natural to ask how to harness their capabilities efficiently. In this paper, we explore whether it is feasible to direct each input query to a single most suitable LLM. To this end, we propose LLM routing for challenging reasoning tasks. Our extensive experiments suggest that such routing shows promise but is not feasible in all scenarios, so more robust approaches should be investigated to fill this gap.

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

TL;DR

No single open-source LLM dominates across benchmarks, motivating a routing approach to assign each input to the most suitable model. The authors develop sparse LLM routing with classifier-based and clustering-based strategies and evaluate them on GSM8K and MMLU using a diverse pool of LLMs, defining oracle and classifier-based upper bounds to quantify potential gains. Results show that routing can outperform weak LLMs but generally cannot surpass the top-performing LLM due to limited training data, while incurring latency trade-offs. The work highlights the feasibility and limitations of LLM routing, and points to data, modeling, and policy improvements as avenues for future gains in efficient, highly accurate utilization of multiple LLMs.

Abstract

With the rapid development of LLMs, it is natural to ask how to harness their capabilities efficiently. In this paper, we explore whether it is feasible to direct each input query to a single most suitable LLM. To this end, we propose LLM routing for challenging reasoning tasks. Our extensive experiments suggest that such routing shows promise but is not feasible in all scenarios, so more robust approaches should be investigated to fill this gap.
Paper Structure (34 sections, 5 figures, 6 tables)

This paper contains 34 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the proposed workflow.
  • Figure 2: Sample zero-shot Chain-of-Thought (CoT) prompt template for a chat (or instruction-tuned) LLM and few-shot Chain-of-Thought (CoT) prompt template for a standard LLM.
  • Figure 3: Distribution of queries from the GSM8K and MMLU test sets solved (score $1.0$ with maj@10) by each LLM. The counts at the bottom of each figure denote the number of questions in each chunk, and those on the right denote the total number of questions solved by each LLM.
  • Figure 4: LLMs "solvability" distribution. The gold label scores are obtained with maj@10, and prediction label scores are obtained with a multi-label classifier.
  • Figure 5: Different ablation configurations for LLMs for GSM8K and MMLU datasets.