Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

Marija Šakota; Maxime Peyrard; Robert West

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

Marija Šakota, Maxime Peyrard, Robert West

TL;DR

FORC tackles the rising cost of deploying large language models by introducing a pre-runtime, cost-aware LM selection framework. It predicts, for each input, the likely performance of each candidate LM using a lightweight meta-model and estimates per-query cost from API pricing, then solves an assignment problem under budget or performance constraints. Across 14 datasets and four LMs, FORC achieves about a 63% cost reduction while matching the accuracy of the largest LM, demonstrating robust cost-efficiency across tasks. The work also releases an open-source library to facilitate real-world adoption and further research on budget-aware prompt routing.

Abstract

Generative language models (LMs) have become omnipresent across data science. For a wide variety of tasks, inputs can be phrased as natural language prompts for an LM, from whose output the solution can then be extracted. LM performance has consistently been increasing with model size - but so has the monetary cost of querying the ever larger models. Importantly, however, not all inputs are equally hard: some require larger LMs for obtaining a satisfactory solution, whereas for others smaller LMs suffice. Based on this fact, we design a framework for cost-effective language model choice, called "Fly-swat or cannon" (FORC). Given a set of inputs and a set of candidate LMs, FORC judiciously assigns each input to an LM predicted to do well on the input according to a so-called meta-model, aiming to achieve high overall performance at low cost. The cost-performance tradeoff can be flexibly tuned by the user. Options include, among others, maximizing total expected performance (or the number of processed inputs) while staying within a given cost budget, or minimizing total cost while processing all inputs. We evaluate FORC on 14 datasets covering five natural language tasks, using four candidate LMs of vastly different size and cost. With FORC, we match the performance of the largest available LM while achieving a cost reduction of 63%. Via our publicly available library, researchers as well as practitioners can thus save large amounts of money without sacrificing performance.

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 5 figures, 4 tables)

This paper contains 14 sections, 2 equations, 5 figures, 4 tables.

Introduction
Background and Related Work
Language-model evaluation
Inference-cost optimization
Method
Problem setting
Framework setting
Experiments
Meta-model evaluation
Framework evaluation
Discussion
Further use cases of the FORC framework
Practical considerations
Conclusion

Figures (5)

Figure 1: Overview of FORC, our framework for cost-effective LM choice (details in Sec. \ref{['sec:framework']}). FORC consists of two steps: (1) Predict cost and performance of each candidate LM on each input query. Cost prediction is done using API pricing. Performance prediction is done using a meta-model, trained ahead of time (not shown) based on existing pairs of LM queries and LM performance scores. (2) Assign each query to at most one LM using an assignment strategy, aiming for high total expected performance at low cost. Note that neither of the two steps requires interacting with the LMs; queries are fed to the assigned LMs only after the above steps.
Figure 2: Calibration plots. Plots are obtained by grouping the meta-model's output probabilities into bins of equal width and then plotting the fraction of positive samples against the mean value of the bin.
Figure 3: Cost-accuracy plot. Accuracy and average cost per query (in US$) achieved by assigning every query from the query set to an LM from the LM pool. The plot shows results obtained using assignment strategies from Sec. \ref{['sec:strategies']}. Single model strategies for each LMs are marked by the number under them: (1) text-ada-001 (2) text-babbage-001 (3) text-curie-001 (4) text-davinci-002. Two thresholding strategies for cases when none of the LMs solve the data sample are marked by a letter under them: choosing (a) the biggest and (b) the smallest LM. Error bars are 95% confidence intervals.
Figure 4: Cost--accuracy plot per dataset. Accuracy and average cost per query (in US$) achieved by employing strategies from Sec. \ref{['sec:strategies']} on the test dataset stratified by datasets from Table \ref{['tab:datasets_specs']}. Single-model and thresholding strategies follow the same trends as in Fig. \ref{['fig:cost_accuracy_plot']}. Error bars are 95% confidence intervals.
Figure 5: Cost--accuracy plot per task. Accuracy and average cost per query (in US$) achieved by employing strategies from Sec. \ref{['sec:strategies']} on the test dataset stratified by tasks from Table \ref{['tab:datasets_specs']}. Single-model and thresholding strategies follow the same trends as in Fig. \ref{['fig:cost_accuracy_plot']}. Error bars are 95% confidence intervals.

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

TL;DR

Abstract

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)