Towards Optimizing the Costs of LLM Usage

Shivanshu Shekhar; Tanishq Dubey; Koyel Mukherjee; Apoorv Saxena; Atharv Tyagi; Nishanth Kotla

Towards Optimizing the Costs of LLM Usage

Shivanshu Shekhar, Tanishq Dubey, Koyel Mukherjee, Apoorv Saxena, Atharv Tyagi, Nishanth Kotla

TL;DR

This work tackles the economic bottleneck of large-language model usage for document processing by proposing QC-Opt, a framework that (1) predicts LLM output quality without incurring API calls using a BertScore-based predictor, (2) solves a quality-aware, cost-aware routing problem with LP-rounding under budget and latency constraints, and (3) reduces input token length via tokenized, controllable sentence simplification and heuristic token reduction. The approach is supported by theoretical results showing NP-hardness of key optimization problems and by practical algorithms for special cases, plus a full end-to-end pipeline validation. Empirically, QC-Opt achieves substantial cost savings ($40\%-90\%$) and improves or maintains output quality by up to $7\%$, with a comprehensive user study confirming alignment with human preferences. The work also provides annotated open datasets to spur further research in cost-efficient LLM deployment and token-aware optimization in enterprise settings.

Abstract

Generative AI and LLMs in particular are heavily used nowadays for various document processing tasks such as question answering and summarization. However, different LLMs come with different capabilities for different tasks as well as with different costs, tokenization, and latency. In fact, enterprises are already incurring huge costs of operating or using LLMs for their respective use cases. In this work, we propose optimizing the usage costs of LLMs by estimating their output quality (without actually invoking the LLMs), and then solving an optimization routine for the LLM selection to either keep costs under a budget, or minimize the costs, in a quality and latency aware manner. We propose a model to predict the output quality of LLMs on document processing tasks like summarization, followed by an LP rounding algorithm to optimize the selection of LLMs. We study optimization problems trading off the quality and costs, both theoretically and empirically. We further propose a sentence simplification model for reducing the number of tokens in a controlled manner. Additionally, we propose several deterministic heuristics for reducing tokens in a quality aware manner, and study the related optimization problem of applying the heuristics optimizing the quality and cost trade-off. We perform extensive empirical validation of our methods on not only enterprise datasets but also on open-source datasets, annotated by us, and show that we perform much better compared to closest baselines. Our methods reduce costs by 40%- 90% while improving quality by 4%-7%. We will release the annotated open source datasets to the community for further research and exploration.

Towards Optimizing the Costs of LLM Usage

TL;DR

) and improves or maintains output quality by up to

, with a comprehensive user study confirming alignment with human preferences. The work also provides annotated open datasets to spur further research in cost-efficient LLM deployment and token-aware optimization in enterprise settings.

Abstract

Paper Structure (24 sections, 4 theorems, 4 equations, 8 figures, 11 tables)

This paper contains 24 sections, 4 theorems, 4 equations, 8 figures, 11 tables.

Introduction
Related Work
Problem Description and Proposed Framework
Smart Router
Optimizing the Choice of LLMs
Budget Aware Optimizer
Quality Aware Cost Minimizer
Polynomial Special Cases
Estimating the Output Quality of LLMs
Experiments on Smart Router
User Study
Optimizing Token Length
Token Optimized Text Simplification
Token Optimization Heuristics
Experiments on Token Optimization
...and 9 more sections

Key Result

theorem 1

The problem Budget-Opt is NP-hard.

Figures (8)

Figure 1: QC-Opt: first, we have a BertScore predictor predicting the output quality of each LLM on each section; second, we have a Budget Aware optimization algorithm, that optimizes the LLM selection to maximize expected (predicted) performance subject to budget and latency constraints; third we have a token optimization module for reducing token length in a quality aware manner.
Figure 2: Bert Score Predictor
Figure 3: Comparison with an LLM Cascade baseline inspired by FrugalGPT. We achieve same quality at considerably lower costs and latency (not shown here).
Figure 4: Confusion matrix for user study
Figure 5: Token optimized sentence simplification example.
...and 3 more figures

Theorems & Definitions (4)

theorem 1
theorem 2
theorem 3
theorem 4

Towards Optimizing the Costs of LLM Usage

TL;DR

Abstract

Towards Optimizing the Costs of LLM Usage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)