Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

Guillem Ramírez; Alexandra Birch; Ivan Titov

Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

Guillem Ramírez, Alexandra Birch, Ivan Titov

TL;DR

The paper tackles budget-constrained use of dual LLMs by proposing Margin Sampling, a simple uncertainty-based cascade that uses the small model's first-token margin to decide when to call a larger model. It avoids training auxiliary metamodels or performing repeated small-model calls, and demonstrates superior cost-accuracy trade-offs across nine tasks and three LLM pairs. Across both single-task and multi-task settings, Margin Sampling outperforms routing, cascading, and meta-model approaches, with robustness to data and cost variations. The findings advocate leveraging intrinsic signals within LLM generations for efficient inference and encourage broader adoption of simple, uncertainty-driven strategies.

Abstract

Researchers and practitioners operating on a limited budget face the cost-performance trade-off dilemma. The challenging decision often centers on whether to use a large LLM with better performance or a smaller one with reduced costs. This has motivated recent research in the optimisation of LLM calls. Either a cascading strategy is used, where a smaller LLM or both are called sequentially, or a routing strategy is used, where only one model is ever called. Both scenarios are dependent on a decision criterion which is typically implemented by an extra neural model. In this work, we propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. We compare our approach with both cascading and routing strategies using three different pairs of pre-trained small and large LLMs, on nine different tasks and against approaches that require an additional neural model. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.

Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

TL;DR

Abstract

Paper Structure (42 sections, 3 equations, 3 figures, 8 tables)

This paper contains 42 sections, 3 equations, 3 figures, 8 tables.

Introduction
Related work
LLM uncertainty
Optimisation of inference costs
Optimisation of LLM API Calls
Optimisation of LLM calls
Problem definition
LLM Calling Strategies
Routing Strategies
Random routing
Routing sakotaroutingexpert
HybridLLM hybrid
Cascading
FrugalGPT frugalgpt
Margin Sampling (ours)
...and 27 more sections

Figures (3)

Figure 1: Routing (left) attempts to select the LLM with the best cost-accuracy trade-off given an incoming query. In cascading (right), all queries are passed through the small model, and depending on its output, the large LLM is consulted. We propose using a cascading approach that uses the margin of the generations to score outputs from the small LLM.
Figure 2: Accuracy curve with respect to budgets. We have averaged results for all the tasks.
Figure 3: Accuracy curve with respect to budgets, in the multi-task setting.

Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

TL;DR

Abstract

Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)