Table of Contents
Fetching ...

Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference

Wei Cheng, Tianlu Wang, Yanmin Ji, Fan Yang, Keren Tan, Yiyu Zheng

TL;DR

A novel in-context comparative inference method is developed that can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting and is developed to alleviate miscalibrations and improve classification performance.

Abstract

While in-context learning with large language models (LLMs) has shown impressive performance, we have discovered a unique miscalibration behavior where both correct and incorrect predictions are assigned the same level of confidence. We refer to this phenomenon as indiscriminate miscalibration. We found that traditional calibration metrics, such as Expected Calibrated Errors (ECEs), are unable to capture this behavior effectively. To address this issue, we propose new metrics to measure the severity of indiscriminate miscalibration. Additionally, we develop a novel in-context comparative inference method to alleviate miscalibrations and improve classification performance. Through extensive experiments on five datasets, we demonstrate that our proposed method can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting.

Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference

TL;DR

A novel in-context comparative inference method is developed that can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting and is developed to alleviate miscalibrations and improve classification performance.

Abstract

While in-context learning with large language models (LLMs) has shown impressive performance, we have discovered a unique miscalibration behavior where both correct and incorrect predictions are assigned the same level of confidence. We refer to this phenomenon as indiscriminate miscalibration. We found that traditional calibration metrics, such as Expected Calibrated Errors (ECEs), are unable to capture this behavior effectively. To address this issue, we propose new metrics to measure the severity of indiscriminate miscalibration. Additionally, we develop a novel in-context comparative inference method to alleviate miscalibrations and improve classification performance. Through extensive experiments on five datasets, we demonstrate that our proposed method can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting.
Paper Structure (28 sections, 9 equations, 11 figures, 4 tables)

This paper contains 28 sections, 9 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Simulated reliability diagrams show different miscalibration behaviors but having the same ECE and accuracy. (a) an indiscriminate miscalibration behavior which is also observed in zero-shot and few-shot prompting in our experiments; (b) a regular miscalibration behavior closer to the original calibration paper guo2017calibration.
  • Figure 2: Reliability diagrams averaged across 5 datasets. The confidence matches the accuracy (y-axis) for a perfectly calibrated model. Hence, the red gaps indicate the severity of miscalibrations for each confidence bin. The first and second rows show the indiscriminate miscalibration behavior of large language models under in-context learning (0-shot and 10-shot) setting where accuracy are similar regardless whether confidences are high or low. In certain cases, lower confidences can give even higher accuracy. Under comparative inference setting (e.g.third row), such issue is alleviated and significantly improved with aggregated comparative inference (e.g.last row).
  • Figure 3: Quantifying indiscriminate calibration using Indiscriminate Ratio (MacroCE) and Discriminate KL (DKL) divergence. The metrics aim to capture the difference between the probability distributions of correct and incorrect predictions. Numbers are averaged across 5 dataset. Bar or Shaded area describes the standard deviations across datasets.A smaller MacroCE (or a larger DKL) indicates a more discriminate calibrated model. Comparative inference helps alleviate indiscriminate miscalibration and the aggregation method can further improve it.
  • Figure 4: Inference performances vary significantly at different positions in the comparative inference setting, examples from the TREC task li-roth-2002-learninghovy-etal-2001-toward. In the zero-shot setting, the performance decays at the second and third positions.
  • Figure 5: Performance (ECE and F1 scores) improve as we aggregate more comparative inference results. Results are averaged across 5 dataset. Notably, the ECE decrease (the lower the better) drastically with aggregations under our assumption eq.\ref{['aggregation eq']}.
  • ...and 6 more figures