Efficiently Deploying LLMs with Controlled Risk

Michael J. Zellinger; Matt Thomson

Efficiently Deploying LLMs with Controlled Risk

Michael J. Zellinger, Matt Thomson

TL;DR

H hierarchical chains with multi-level abstention (HCMA) are presented, which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls.

Abstract

Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled examples to achieve excellent calibration error (ECE), cutting ECE by 50% compared to naive Platt scaling. On free-form generation tasks, we find that chain-of-thought is ineffectual for selective prediction, whereas zero-shot prompting drives error to 0% on TruthfulQA at high abstention rates. As LLMs are increasingly deployed across computing environments with different capabilities (such as mobile, laptop, and cloud), our framework paves the way towards maintaining deployment efficiency while putting in place sharp risk controls.

Efficiently Deploying LLMs with Controlled Risk

TL;DR

Abstract

Paper Structure (14 sections, 2 theorems, 6 equations, 5 figures, 1 table)

This paper contains 14 sections, 2 theorems, 6 equations, 5 figures, 1 table.

Introduction
Related Work
Our Contributions
Mathematical Framework
Why Does Delegation Work?
Hierarchical Chains with Multi-Level Abstention
Analyzing System Performance: Efficiency and Risk
Making Model-Intrinsic Uncertainty Work For HCMAs
Results
Transforming Raw Probabilities Significantly Enhances Platt Scaling
HCMA Provides Novel Risk & Efficiency Trade-Offs
Early Abstention Makes Lowest Risk Cheaper
Chain-of-Thought Prompting May Impede Selective Prediction
Conclusion

Key Result

Proposition 1

Compared to randomly assigning queries to a small language model $\mathcal{M}_\text{sm}$ and $\mathcal{M}_\text{lg}$, the reduction in error from forwarding queries based on a delegation decision $D$ is

Figures (5)

Figure 1: Model-intrinsic uncertainty aligns across Llama3 8B, 70B, and 405B models on MMLU, seemingly reflecting a shared sense of difficulty, but with the larger models more resistant to it. The figure shows logistic regressions fit to the transformed probabilities of the 8B model.
Figure 2: Illustration of a hierarchical chain with multi-level abstention (left), which consists of three models M1-M3, and the action space of an individual model in the chain (right), as a function of the estimated correctness probability.
Figure 3: Achievable performance profiles of a HCMA consisting of the Llama3 8B, 70B, and 405B models on MMLU. The graph shows the Pareto frontier of the most efficient HCMA configurations.
Figure 4: Cost-based view of the Pareto frontier of achievable HCMA performance profiles on MMLU. Each dashed line is the average performance of efficient HCMA configurations in a cost bucket, showing that HCMA outperforms the selective prediction performance of single LLMs (solid blue lines).
Figure 5: On TruthfulQA, using chain-of-thought prompting for verification leads to highly clustered correctness probabilities (a) that perform poorly as the abstention signal for selective prediction (c). By contrast, using zero-shot prompts leads to a unimodal distribution (b) that performs well as an abstention signal for selective prediction (d).

Theorems & Definitions (2)

Proposition 1
Proposition 2

Efficiently Deploying LLMs with Controlled Risk

TL;DR

Abstract

Efficiently Deploying LLMs with Controlled Risk

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)