Table of Contents
Fetching ...

SteerConf: Steering LLMs for Confidence Elicitation

Ziang Zhou, Tianyuan Jin, Jieming Shi, Qing Li

TL;DR

SteerConf addresses overconfidence in LLMs by steering verbalized confidence through a prompt-based framework that collects responses across multiple steering levels, quantifies confidence consistency, and calibrates final confidence without model training. The method integrates steering prompts, steered confidence consistency, and calibrated selection to improve both confidence calibration and failure prediction, demonstrated across seven benchmarks and three state-of-the-art models. Results show substantial gains in calibration (lower ECE) and failure detection (higher AUROC and PR metrics) compared with vanilla prompts and baselines, including notable improvements under Chain-of-Thought prompting. This work offers a practical, black-box strategy to enhance the reliability of LLMs in high-stakes settings, supporting safer real-world deployments while avoiding internal model access or fine-tuning.

Abstract

Large Language Models (LLMs) exhibit impressive performance across diverse domains but often suffer from overconfidence, limiting their reliability in critical applications. We propose SteerConf, a novel framework that systematically steers LLMs' confidence scores to improve their calibration and reliability. SteerConf introduces three key components: (1) a steering prompt strategy that guides LLMs to produce confidence scores in specified directions (e.g., conservative or optimistic) by leveraging prompts with varying steering levels; (2) a steered confidence consistency measure that quantifies alignment across multiple steered confidences to enhance calibration; and (3) a steered confidence calibration method that aggregates confidence scores using consistency measures and applies linear quantization for answer selection. SteerConf operates without additional training or fine-tuning, making it broadly applicable to existing LLMs. Experiments on seven benchmarks spanning professional knowledge, common sense, ethics, and reasoning tasks, using advanced LLM models (GPT-3.5, LLaMA 3, GPT-4), demonstrate that SteerConf significantly outperforms existing methods, often by a significant margin. Our findings highlight the potential of steering the confidence of LLMs to enhance their reliability for safer deployment in real-world applications.

SteerConf: Steering LLMs for Confidence Elicitation

TL;DR

SteerConf addresses overconfidence in LLMs by steering verbalized confidence through a prompt-based framework that collects responses across multiple steering levels, quantifies confidence consistency, and calibrates final confidence without model training. The method integrates steering prompts, steered confidence consistency, and calibrated selection to improve both confidence calibration and failure prediction, demonstrated across seven benchmarks and three state-of-the-art models. Results show substantial gains in calibration (lower ECE) and failure detection (higher AUROC and PR metrics) compared with vanilla prompts and baselines, including notable improvements under Chain-of-Thought prompting. This work offers a practical, black-box strategy to enhance the reliability of LLMs in high-stakes settings, supporting safer real-world deployments while avoiding internal model access or fine-tuning.

Abstract

Large Language Models (LLMs) exhibit impressive performance across diverse domains but often suffer from overconfidence, limiting their reliability in critical applications. We propose SteerConf, a novel framework that systematically steers LLMs' confidence scores to improve their calibration and reliability. SteerConf introduces three key components: (1) a steering prompt strategy that guides LLMs to produce confidence scores in specified directions (e.g., conservative or optimistic) by leveraging prompts with varying steering levels; (2) a steered confidence consistency measure that quantifies alignment across multiple steered confidences to enhance calibration; and (3) a steered confidence calibration method that aggregates confidence scores using consistency measures and applies linear quantization for answer selection. SteerConf operates without additional training or fine-tuning, making it broadly applicable to existing LLMs. Experiments on seven benchmarks spanning professional knowledge, common sense, ethics, and reasoning tasks, using advanced LLM models (GPT-3.5, LLaMA 3, GPT-4), demonstrate that SteerConf significantly outperforms existing methods, often by a significant margin. Our findings highlight the potential of steering the confidence of LLMs to enhance their reliability for safer deployment in real-world applications.

Paper Structure

This paper contains 19 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Vanilla and very cautious Prompt Steered Confidence on Object Counting data with GPT-3.5.
  • Figure 2: Overview of the SteerConf framework, comprising three components: Steering Prompting, Steered Confidence Consistency, and Steered Confidence Calibration. Five steering levels, from very cautious to very confident, elicit verbalized confidences and answers from the LLM. Confidence consistency $\kappa_{conf}$ evaluates the stability of these confidences. The calibrated confidence $c(x)$ is computed by combining the mean confidence $\mu_c$, confidence consistency $\kappa_{conf}$, and answer consistency $\kappa_{ans}$, and the final answer $f(x)$ is selected based on $c(x)$.
  • Figure 3: Confidence histograms of SteerConf and baselines for LLaMA3 with CoT. The rows are different datasets, and the columns are different methods.
  • Figure 4: Confidence histograms of different steering levels for LLaMA3 with CoT on StrategyQA and Law datasets. The columns are different steering levels.
  • Figure 5: Confidence histograms for LLaMA3 with CoT on DateUnd, GSM8K and Ethics datasets. The columns are different steering levels.