Cost-Saving LLM Cascades with Early Abstention

Michael J. Zellinger; Rex Liu; Matt Thomson

Cost-Saving LLM Cascades with Early Abstention

Michael J. Zellinger, Rex Liu, Matt Thomson

TL;DR

In risk-sensitive settings, the paper investigates whether allowing small LLMs in a cascade to abstain early can reduce reliance on expensive large models without increasing errors. It extends a continuous, threshold-tuning framework to a multi-objective setting with abstention, and models cross-model confidence via a Markov factorization to optimize thresholds across user preferences. The main findings show an average cascade-loss reduction of $2.2\%$ across six benchmarks, driven by a $13.0\%$ cost savings and a $5.0\%$ reduction in errors, with a modest rise in abstention and strong gains when cost is a priority. The work highlights the value of leveraging correlations between model error patterns to design more reliable and cost-efficient LLM cascades for risk-sensitive applications.

Abstract

LLM cascades deploy small LLMs to answer most queries, limiting the use of large and expensive LLMs to difficult queries. This approach can significantly reduce costs without impacting performance. However, risk-sensitive domains such as finance or medicine place an additional premium on avoiding model errors. Since even the most expensive models are susceptible to making mistakes, applications in these domains benefit from allowing LLM systems to completely abstain from answering difficult queries. Introducing abstention poses a design question for LLM cascades: should abstention only be allowed at the final model or also at earlier models? Since the error patterns of small and large models are correlated, allowing earlier models to abstain may reduce inference costs and latency by anticipating abstention decisions by expensive and slow models, thus avoiding the need to run these models. We investigate the benefits of such "early abstention" in LLM cascades and find that it reduces overall test loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA, TruthfulQA, and XSum). These gains result from a more effective use of abstention, trading a 4.1% average increase in the overall abstention rate for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings demonstrate the possibility of leveraging correlations between the error patterns of different language models to drive performance improvements for LLM systems with abstention.

Cost-Saving LLM Cascades with Early Abstention

TL;DR

Abstract

Cost-Saving LLM Cascades with Early Abstention

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)

Theorems & Definitions (2)