Table of Contents
Fetching ...

Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles

Yuxi Xia, Pedro Henrique Luz de Araujo, Klim Zaporojets, Benjamin Roth

TL;DR

This work tackles LLM calibration across diverse prompt styles and model sizes by introducing Calib-n, an auxiliary-model framework that aggregates responses from multiple LLMs to estimate per-answer confidence. It jointly optimizes confidence with BCE, FL, and AUC surrogate losses, and evaluates across four open-ended QA datasets with 12 LLMs. The results show that inter-model response agreement and focal loss substantially improve calibration, with few-shot prompts yielding additional gains and auxiliary models maintaining robust calibration under accuracy variations. The findings highlight the value of considering prompt style, cross-model agreement, and tailored loss functions for reliable confidence estimation in diverse real-world deployments.

Abstract

Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs' internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.

Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles

TL;DR

This work tackles LLM calibration across diverse prompt styles and model sizes by introducing Calib-n, an auxiliary-model framework that aggregates responses from multiple LLMs to estimate per-answer confidence. It jointly optimizes confidence with BCE, FL, and AUC surrogate losses, and evaluates across four open-ended QA datasets with 12 LLMs. The results show that inter-model response agreement and focal loss substantially improve calibration, with few-shot prompts yielding additional gains and auxiliary models maintaining robust calibration under accuracy variations. The findings highlight the value of considering prompt style, cross-model agreement, and tailored loss functions for reliable confidence estimation in diverse real-world deployments.

Abstract

Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs' internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.
Paper Structure (31 sections, 6 equations, 11 figures, 8 tables)

This paper contains 31 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of calibration training of Calib-n ($n$ indicates the number of LLMs that provide responses). We consider the effect of different prompt styles on calibration and design four diverse prompts to query $n$ target LLMs to provide answers to a question. $n$ joint strings of the question with each answer presenting the response agreement of LLMs are passed to the auxiliary model to generate probabilities for each answer. The auxiliary model is optimized with three loss functions respectively on the correctness of the LLM answers. Brier score (one of four metrics) is used to evaluate the calibration performance of the auxiliary model.
  • Figure 2: Comparison of Calib-1 and Calib-n methods based on the number of wins across different loss functions for calibrating small (left) and large (right) LLMs.
  • Figure 3: The winning comparison results of different methods and prompts: \ref{['fig:main_bce']} and \ref{['fig:main_fl']} sub-figures present the superior results of Calib-* methods using BCE and FL loss respectively when against baselines. \ref{['fig:main_all']} shows the comparison results among all Calib-* methods, demonstrating that (FL)Calib-1 achieves the best overall performance. \ref{['fig:main_prompt']} compares the winning result among all prompt styles and shows that using few-shot prompts is the most beneficial.
  • Figure 4: The correlation between accuracies achieved by different configurations (i.e., prompts, models, datasets) and corresponding ECE scores evaluated on different methods. The line of Verbalized % is not continuous because it only applies to Verb. prompts and thus has fewer accuracy points than other methods. The result indicates that Calib-* and APRICOT are robust to accuracy variations. Different methods achieve the lowest ECE scores in different accuracy ranges.
  • Figure 5: Reliability diagrams for our different methods using 10 bins each for Llama3.1-70b on NQ. The color and the percentage number within each bar present the proportion of total data samples contained in each bin. More figures of other models and datasets are shown in Appendix \ref{['other_diagrams']}.
  • ...and 6 more figures