Does Alignment Tuning Really Break LLMs' Internal Confidence?
Hongseok Oh, Wonseok Hwang
TL;DR
This work interrogates whether alignment tuning degrades LLM calibration by performing a comprehensive, multi-dimensional study across models, calibration metrics, tasks, and confidence extraction methods. It introduces and compares continuation-sum, continuation-min, and a capital-letter choice method for extracting model confidence, applying zero-shot evaluation to open LLMs with $ECE$ and $SCE$ metrics. While initial results show some $ECE$-level improvements post-alignment, stricter analyses reveal that calibration degrades consistently when using $SCE$ as the metric, especially with the choice-based confidence extraction. The findings underscore the need for careful calibration measurement and motivate algorithms that enable instruction-following without sacrificing calibration, with practical implications for deploying reliable LLMs.
Abstract
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.
