Table of Contents
Fetching ...

ThriftLLM: On Cost-Effective Selection of Large Language Models for Classification Queries

Keke Huang, Yimin Shi, Dujian Ding, Yifei Li, Yang Fei, Laks Lakshmanan, Xiaokui Xiao

TL;DR

ThriftLLM addresses the challenge of selecting a cost-effective ensemble of LLMs for classification by formalizing the ensemble correctness as $\\xi(\\\mathcal{S})$ under a budget $B$ and introducing an aggregation scheme based on per-class beliefs. It proves that the exact objective is non-decreasing but non-submodular and likely NP-hard, then introduces a surrogate submodular objective $\gamma(\mathcal{S})$ to enable a principled surrogate greedy approach (SurGreedyLLM) with instance-dependent guarantees. The adaptive ThriftLLM algorithm further refines the selected ensemble at inference time by stopping early when remaining models cannot improve the predicted class, achieving equivalent accuracy with reduced cost. Extensive experiments on real-world text classification and entity matching tasks demonstrate that ThriftLLM delivers higher accuracy within fixed budgets compared to several baselines and even per-dataset strong single models, highlighting its practical impact for cost-aware LLM deployment. The work lays a foundation for budget-aware LLM routing and suggests future extensions to regression and generation tasks.

Abstract

In recent years, large language models (LLMs) have demonstrated remarkable capabilities in comprehending and generating natural language content, attracting widespread attention in both industry and academia. An increasing number of services offer LLMs for various tasks via APIs. Different LLMs demonstrate expertise in different domains of queries (e.g., text classification queries). Meanwhile, LLMs of different scales, complexities, and performance are priced diversely. Driven by this, several researchers are investigating strategies for selecting an ensemble of LLMs, aiming to decrease overall usage costs while enhancing performance. However, to the best of our knowledge, none of the existing works addresses the problem, how to find an LLM ensemble subject to a cost budget, which maximizes the ensemble performance with guarantees. In this paper, we formalize the performance of an ensemble of models (LLMs) using the notion of correctness probability, which we formally define. We develop an approach for aggregating responses from multiple LLMs to enhance ensemble performance. Building on this, we formulate the Optimal Ensemble Selection problem of selecting a set of LLMs subject to a cost budget that maximizes the overall correctness probability. We show that the correctness probability function is non-decreasing and non-submodular and provide evidence that the Optimal Ensemble Selection problem is likely to be NP-hard. By leveraging a submodular function that upper bounds correctness probability, we develop an algorithm called ThriftLLM and prove that it achieves an instance-dependent approximation guarantee with high probability. Our framework functions as a data processing system that selects appropriate LLM operators to deliver high-quality results under budget constraints.

ThriftLLM: On Cost-Effective Selection of Large Language Models for Classification Queries

TL;DR

ThriftLLM addresses the challenge of selecting a cost-effective ensemble of LLMs for classification by formalizing the ensemble correctness as under a budget and introducing an aggregation scheme based on per-class beliefs. It proves that the exact objective is non-decreasing but non-submodular and likely NP-hard, then introduces a surrogate submodular objective to enable a principled surrogate greedy approach (SurGreedyLLM) with instance-dependent guarantees. The adaptive ThriftLLM algorithm further refines the selected ensemble at inference time by stopping early when remaining models cannot improve the predicted class, achieving equivalent accuracy with reduced cost. Extensive experiments on real-world text classification and entity matching tasks demonstrate that ThriftLLM delivers higher accuracy within fixed budgets compared to several baselines and even per-dataset strong single models, highlighting its practical impact for cost-aware LLM deployment. The work lays a foundation for budget-aware LLM routing and suggests future extensions to regression and generation tasks.

Abstract

In recent years, large language models (LLMs) have demonstrated remarkable capabilities in comprehending and generating natural language content, attracting widespread attention in both industry and academia. An increasing number of services offer LLMs for various tasks via APIs. Different LLMs demonstrate expertise in different domains of queries (e.g., text classification queries). Meanwhile, LLMs of different scales, complexities, and performance are priced diversely. Driven by this, several researchers are investigating strategies for selecting an ensemble of LLMs, aiming to decrease overall usage costs while enhancing performance. However, to the best of our knowledge, none of the existing works addresses the problem, how to find an LLM ensemble subject to a cost budget, which maximizes the ensemble performance with guarantees. In this paper, we formalize the performance of an ensemble of models (LLMs) using the notion of correctness probability, which we formally define. We develop an approach for aggregating responses from multiple LLMs to enhance ensemble performance. Building on this, we formulate the Optimal Ensemble Selection problem of selecting a set of LLMs subject to a cost budget that maximizes the overall correctness probability. We show that the correctness probability function is non-decreasing and non-submodular and provide evidence that the Optimal Ensemble Selection problem is likely to be NP-hard. By leveraging a submodular function that upper bounds correctness probability, we develop an algorithm called ThriftLLM and prove that it achieves an instance-dependent approximation guarantee with high probability. Our framework functions as a data processing system that selects appropriate LLM operators to deliver high-quality results under budget constraints.
Paper Structure (21 sections, 11 theorems, 12 equations, 14 figures, 8 tables, 3 algorithms)

This paper contains 21 sections, 11 theorems, 12 equations, 14 figures, 8 tables, 3 algorithms.

Key Result

proposition 1

The correctness probability $\xi\xspace(\mathcal{S}\xspace)$ is independent of the ground-truth class $C_q$ of the random query $q$.

Figures (14)

  • Figure 1: Overview of ThriftLLM: $R_i, R_j, R$ denote responses.
  • Figure 2: Example of an observation space $\Omega_\mathcal{S}\xspace$.
  • Figure 3: Prompt template for AGNews dataset.
  • Figure 4: Accuracy vs cost for text classification query.
  • Figure 5: F1 score vs cost for entity matching query.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Definition 1: Correctness Probability
  • Definition 2: Optimal Ensemble Selection
  • proposition 1
  • Lemma 1
  • Lemma 2
  • proposition 2
  • Lemma 3
  • Theorem 1
  • proposition 3
  • Lemma 4
  • ...and 3 more