Table of Contents
Fetching ...

Efficiently Deploying LLMs with Controlled Risk

Michael J. Zellinger, Matt Thomson

TL;DR

H hierarchical chains with multi-level abstention (HCMA) are presented, which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls.

Abstract

Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled examples to achieve excellent calibration error (ECE), cutting ECE by 50% compared to naive Platt scaling. On free-form generation tasks, we find that chain-of-thought is ineffectual for selective prediction, whereas zero-shot prompting drives error to 0% on TruthfulQA at high abstention rates. As LLMs are increasingly deployed across computing environments with different capabilities (such as mobile, laptop, and cloud), our framework paves the way towards maintaining deployment efficiency while putting in place sharp risk controls.

Efficiently Deploying LLMs with Controlled Risk

TL;DR

H hierarchical chains with multi-level abstention (HCMA) are presented, which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls.

Abstract

Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled examples to achieve excellent calibration error (ECE), cutting ECE by 50% compared to naive Platt scaling. On free-form generation tasks, we find that chain-of-thought is ineffectual for selective prediction, whereas zero-shot prompting drives error to 0% on TruthfulQA at high abstention rates. As LLMs are increasingly deployed across computing environments with different capabilities (such as mobile, laptop, and cloud), our framework paves the way towards maintaining deployment efficiency while putting in place sharp risk controls.
Paper Structure (14 sections, 2 theorems, 6 equations, 5 figures, 1 table)

This paper contains 14 sections, 2 theorems, 6 equations, 5 figures, 1 table.

Key Result

Proposition 1

Compared to randomly assigning queries to a small language model $\mathcal{M}_\text{sm}$ and $\mathcal{M}_\text{lg}$, the reduction in error from forwarding queries based on a delegation decision $D$ is

Figures (5)

  • Figure 1: Model-intrinsic uncertainty aligns across Llama3 8B, 70B, and 405B models on MMLU, seemingly reflecting a shared sense of difficulty, but with the larger models more resistant to it. The figure shows logistic regressions fit to the transformed probabilities of the 8B model.
  • Figure 2: Illustration of a hierarchical chain with multi-level abstention (left), which consists of three models M1-M3, and the action space of an individual model in the chain (right), as a function of the estimated correctness probability.
  • Figure 3: Achievable performance profiles of a HCMA consisting of the Llama3 8B, 70B, and 405B models on MMLU. The graph shows the Pareto frontier of the most efficient HCMA configurations.
  • Figure 4: Cost-based view of the Pareto frontier of achievable HCMA performance profiles on MMLU. Each dashed line is the average performance of efficient HCMA configurations in a cost bucket, showing that HCMA outperforms the selective prediction performance of single LLMs (solid blue lines).
  • Figure 5: On TruthfulQA, using chain-of-thought prompting for verification leads to highly clustered correctness probabilities (a) that perform poorly as the abstention signal for selective prediction (c). By contrast, using zero-shot prompts leads to a unimodal distribution (b) that performs well as an abstention signal for selective prediction (d).

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2