Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

Surya Narayanan Hari; Rex Liu; Matt Thomson

Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

Surya Narayanan Hari, Rex Liu, Matt Thomson

TL;DR

It is shown that a herd of open source models can match or exceed the performance of proprietary models via an intelligent router and that in cases where GPT is not able to answer the query, Herd is able to identify a model that can, at least 40% of the time.

Abstract

Currently, over a thousand LLMs exist that are multi-purpose and are capable of performing real world tasks, including Q&A, text summarization, content generation, etc. However, accessibility, scale and reliability of free models prevents them from being widely deployed in everyday use cases. To address the first two issues of access and scale, organisations such as HuggingFace have created model repositories where users have uploaded model weights and quantized versions of models trained using different paradigms, as well as model cards describing their training process. While some models report performance on commonly used benchmarks, not all do, and interpreting the real world impact of trading off performance on a benchmark for model deployment cost, is unclear. Here, we show that a herd of open source models can match or exceed the performance of proprietary models via an intelligent router. We show that a Herd of open source models is able to match the accuracy of ChatGPT, despite being composed of models that are effectively 2.5x smaller. We show that in cases where GPT is not able to answer the query, Herd is able to identify a model that can, at least 40% of the time.

Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

TL;DR

Abstract

Paper Structure (8 sections, 1 equation, 10 figures)

This paper contains 8 sections, 1 equation, 10 figures.

Introduction
Discussion
Additional details on processing of the datasets
TruthfulQA:
MMLU:
GSM8k:
LAMBADA:
Advantages of using the herd

Figures (10)

Figure 1: Caption
Figure 2: Herd outperforms the best open source model in the herd by learning what models are more effective at certain questions.
Figure 3: Caption
Figure 4: a) A router trained to model the performance of a herd offers comparable performance to GPT 3.5 Turbo (mean performances shown as horizontal lines). b) GPT exceeds the performance of the Herd in only 26% of incoming queries, implying 74% of incoming queries can be answered by open source models in the Herd. c) In questions that ChatGPT gets wrong the Herd can find models that perform correctly (Average of 0.9 F1). A routing model, achieves an aggregate of 0.76 F1 on these questions.
Figure 5: Routed distribution
...and 5 more figures

Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

TL;DR

Abstract

Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

Authors

TL;DR

Abstract

Table of Contents

Figures (10)