Table of Contents
Fetching ...

Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models

Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman

TL;DR

This work identifies confidence mis-calibration in large language models for many-choice MCQA, where correct answers become under-confident and incorrect ones over-confident as the choice count grows. It introduces Self-Ensemble, a plug-in mechanism that divides the answer space into smaller groups and aggregates group-level predictions using a designed attention mask and positional re-encoding, enabling intrinsic multi-group inference without labeled tuning data. Empirically, Self-Ensemble improves accuracy across multiple models and datasets, enhances calibration by raising confidence on correct answers and lowering it on incorrect ones, and exhibits favorable scaling with model size, including applicability to quantized LLMs. The method is model-agnostic, does not require retraining, and can complement existing calibration or prompting strategies, offering a practical route to more reliable MCQA with LLMs.

Abstract

Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.

Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models

TL;DR

This work identifies confidence mis-calibration in large language models for many-choice MCQA, where correct answers become under-confident and incorrect ones over-confident as the choice count grows. It introduces Self-Ensemble, a plug-in mechanism that divides the answer space into smaller groups and aggregates group-level predictions using a designed attention mask and positional re-encoding, enabling intrinsic multi-group inference without labeled tuning data. Empirically, Self-Ensemble improves accuracy across multiple models and datasets, enhances calibration by raising confidence on correct answers and lowering it on incorrect ones, and exhibits favorable scaling with model size, including applicability to quantized LLMs. The method is model-agnostic, does not require retraining, and can complement existing calibration or prompting strategies, offering a practical route to more reliable MCQA with LLMs.

Abstract

Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.

Paper Structure

This paper contains 46 sections, 6 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Self-Ensemble's comprehensive performance on the QASC, TruthfulQA, and MMLU-Pro Biology datasets compared with baseline methods.
  • Figure 2: Proportion of model prediction probability exceeding a threshold on the QASC dataset, for each model under both correct- and incorrect-answer conditions.
  • Figure 3: LLMs ignore the correct choice and pick the incorrect one in the many-choice setting.
  • Figure 4: Example of Self-Ensemble process on 4-choice QA.
  • Figure 5: Plug-in Self-Ensemble: by incorporating the attention mask and positional re-encoding, LLMs can achieve the ensembled results in a single forward pass.
  • ...and 2 more figures