Does Alignment Tuning Really Break LLMs' Internal Confidence?

Hongseok Oh; Wonseok Hwang

Does Alignment Tuning Really Break LLMs' Internal Confidence?

Hongseok Oh, Wonseok Hwang

TL;DR

This work interrogates whether alignment tuning degrades LLM calibration by performing a comprehensive, multi-dimensional study across models, calibration metrics, tasks, and confidence extraction methods. It introduces and compares continuation-sum, continuation-min, and a capital-letter choice method for extracting model confidence, applying zero-shot evaluation to open LLMs with $ECE$ and $SCE$ metrics. While initial results show some $ECE$-level improvements post-alignment, stricter analyses reveal that calibration degrades consistently when using $SCE$ as the metric, especially with the choice-based confidence extraction. The findings underscore the need for careful calibration measurement and motivate algorithms that enable instruction-following without sacrificing calibration, with practical implications for deploying reliable LLMs.

Abstract

Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.

Does Alignment Tuning Really Break LLMs' Internal Confidence?

TL;DR

and

metrics. While initial results show some

-level improvements post-alignment, stricter analyses reveal that calibration degrades consistently when using

as the metric, especially with the choice-based confidence extraction. The findings underscore the need for careful calibration measurement and motivate algorithms that enable instruction-following without sacrificing calibration, with practical implications for deploying reliable LLMs.

Abstract

Paper Structure (4 sections, 2 figures, 1 table)

This paper contains 4 sections, 2 figures, 1 table.

Introduction
Experiment
Results and Analysis
Discussion and Conclusion

Figures (2)

Figure 1: Expected Calibration Error (ECE) scores of pretrained or instruction-tuned LLMs (left). The ECE change rates (CR) vary significantly depending on the choice of metrics (right).
Figure 2: The rates of change in calibration metrics between instruction-tuned and pre-trained models. sum, min, and choice correspond to continuation-sum, continuation-min, and choice methods, respectively. Humanities and STEM are from MMLU

Does Alignment Tuning Really Break LLMs' Internal Confidence?

TL;DR

Abstract

Does Alignment Tuning Really Break LLMs' Internal Confidence?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)