Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

KV Aditya Srivatsa; Kaushal Kumar Maurya; Ekaterina Kochmar

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar

TL;DR

No single open-source LLM dominates across benchmarks, motivating a routing approach to assign each input to the most suitable model. The authors develop sparse LLM routing with classifier-based and clustering-based strategies and evaluate them on GSM8K and MMLU using a diverse pool of LLMs, defining oracle and classifier-based upper bounds to quantify potential gains. Results show that routing can outperform weak LLMs but generally cannot surpass the top-performing LLM due to limited training data, while incurring latency trade-offs. The work highlights the feasibility and limitations of LLM routing, and points to data, modeling, and policy improvements as avenues for future gains in efficient, highly accurate utilization of multiple LLMs.

Abstract

With the rapid development of LLMs, it is natural to ask how to harness their capabilities efficiently. In this paper, we explore whether it is feasible to direct each input query to a single most suitable LLM. To this end, we propose LLM routing for challenging reasoning tasks. Our extensive experiments suggest that such routing shows promise but is not feasible in all scenarios, so more robust approaches should be investigated to fill this gap.

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

TL;DR

Abstract

Paper Structure (34 sections, 5 figures, 6 tables)

This paper contains 34 sections, 5 figures, 6 tables.

Introduction
Methodology
LLM Sampling
Selection of Benchmarks and LLMs
Routing Data
LLM Routing
Classifier-Based Routing
Clustering-Based Routing
Experimental Setup
LLM Routing Baseline Models
Classifier Upper Bound
Results and Discussion
Does including multiple LLMs solve all questions in a given dataset?
How effective is a routing model when randomly picking LLMs?
Is the joint performance of multiple LLMs better than that of individual LLMs?
...and 19 more sections

Figures (5)

Figure 1: Overview of the proposed workflow.
Figure 2: Sample zero-shot Chain-of-Thought (CoT) prompt template for a chat (or instruction-tuned) LLM and few-shot Chain-of-Thought (CoT) prompt template for a standard LLM.
Figure 3: Distribution of queries from the GSM8K and MMLU test sets solved (score $1.0$ with maj@10) by each LLM. The counts at the bottom of each figure denote the number of questions in each chunk, and those on the right denote the total number of questions solved by each LLM.
Figure 4: LLMs "solvability" distribution. The gold label scores are obtained with maj@10, and prediction label scores are obtained with a multi-label classifier.
Figure 5: Different ablation configurations for LLMs for GSM8K and MMLU datasets.

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

TL;DR

Abstract

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)