Table of Contents
Fetching ...

Cost-Saving LLM Cascades with Early Abstention

Michael J. Zellinger, Rex Liu, Matt Thomson

TL;DR

In risk-sensitive settings, the paper investigates whether allowing small LLMs in a cascade to abstain early can reduce reliance on expensive large models without increasing errors. It extends a continuous, threshold-tuning framework to a multi-objective setting with abstention, and models cross-model confidence via a Markov factorization to optimize thresholds across user preferences. The main findings show an average cascade-loss reduction of $2.2\%$ across six benchmarks, driven by a $13.0\%$ cost savings and a $5.0\%$ reduction in errors, with a modest rise in abstention and strong gains when cost is a priority. The work highlights the value of leveraging correlations between model error patterns to design more reliable and cost-efficient LLM cascades for risk-sensitive applications.

Abstract

LLM cascades deploy small LLMs to answer most queries, limiting the use of large and expensive LLMs to difficult queries. This approach can significantly reduce costs without impacting performance. However, risk-sensitive domains such as finance or medicine place an additional premium on avoiding model errors. Since even the most expensive models are susceptible to making mistakes, applications in these domains benefit from allowing LLM systems to completely abstain from answering difficult queries. Introducing abstention poses a design question for LLM cascades: should abstention only be allowed at the final model or also at earlier models? Since the error patterns of small and large models are correlated, allowing earlier models to abstain may reduce inference costs and latency by anticipating abstention decisions by expensive and slow models, thus avoiding the need to run these models. We investigate the benefits of such "early abstention" in LLM cascades and find that it reduces overall test loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA, TruthfulQA, and XSum). These gains result from a more effective use of abstention, trading a 4.1% average increase in the overall abstention rate for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings demonstrate the possibility of leveraging correlations between the error patterns of different language models to drive performance improvements for LLM systems with abstention.

Cost-Saving LLM Cascades with Early Abstention

TL;DR

In risk-sensitive settings, the paper investigates whether allowing small LLMs in a cascade to abstain early can reduce reliance on expensive large models without increasing errors. It extends a continuous, threshold-tuning framework to a multi-objective setting with abstention, and models cross-model confidence via a Markov factorization to optimize thresholds across user preferences. The main findings show an average cascade-loss reduction of across six benchmarks, driven by a cost savings and a reduction in errors, with a modest rise in abstention and strong gains when cost is a priority. The work highlights the value of leveraging correlations between model error patterns to design more reliable and cost-efficient LLM cascades for risk-sensitive applications.

Abstract

LLM cascades deploy small LLMs to answer most queries, limiting the use of large and expensive LLMs to difficult queries. This approach can significantly reduce costs without impacting performance. However, risk-sensitive domains such as finance or medicine place an additional premium on avoiding model errors. Since even the most expensive models are susceptible to making mistakes, applications in these domains benefit from allowing LLM systems to completely abstain from answering difficult queries. Introducing abstention poses a design question for LLM cascades: should abstention only be allowed at the final model or also at earlier models? Since the error patterns of small and large models are correlated, allowing earlier models to abstain may reduce inference costs and latency by anticipating abstention decisions by expensive and slow models, thus avoiding the need to run these models. We investigate the benefits of such "early abstention" in LLM cascades and find that it reduces overall test loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA, TruthfulQA, and XSum). These gains result from a more effective use of abstention, trading a 4.1% average increase in the overall abstention rate for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings demonstrate the possibility of leveraging correlations between the error patterns of different language models to drive performance improvements for LLM systems with abstention.

Paper Structure

This paper contains 7 sections, 10 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Our framework evaluates cascade performance across the range of user preferences by taking into account the user's desire to reduce cost ($\lambda_{c}$) and abstention ($\lambda_{a}$), relative to the need for avoiding errors.
  • Figure 2: Precision-recall curves for early abstention-- predicting a large model's decision to abstain using small models' confidence scores; % is the large model's abstention rate. Performance visibly exceeds the random baseline (dashed gray line, equal to large model's abstention rate). At lower recall, smaller models can often predict final-model abstentions with high precision.
  • Figure 3: Percentage change in overall cascade loss when allowing early abstention vs only abstaining at the final model, for the Llama3.2 1B $\rightarrow$ Llama3.1 405B cascade. The performance benefits of early abstention concentrate in the top right corner of the user preference space (blue), where the user's sensitivity to abstention ($\lambda_a$) is low to moderate and the sensitivity to cost ($\lambda_c$) is moderate to high.

Theorems & Definitions (2)

  • Definition : Early Abstention
  • Definition : Cascade Loss