Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

Han Zhou; Xingchen Wan; Lev Proleev; Diana Mincu; Jilin Chen; Katherine Heller; Subhrajit Roy

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy

TL;DR

This work tackles prompt brittleness and contextual bias in in-context learning by proposing Batch Calibration (BC), a zero-shot, inference-only method that marginalizes contextual bias across batched inputs. It unifies and analyzes existing calibration approaches (CC, DC, PC) through decision-boundary perspectives, identifies their failure modes, and motivates BC, with an extension to black-box few-shot learning (BCL). BC achieves state-of-the-art results on PaLM 2 and CLIP across 10+ NLP and vision-language tasks, while remaining inexpensive and robust to prompt design choices; BCL offers additional gains when labeled data are available. The approach generalizes across modalities and reduces the need for careful prompt engineering, enabling more reliable and scalable deployment of LLM-based systems. The work emphasizes the practical value of a simple, principled calibration layer that Accounts for contextual priors without retraining, potentially impacting a wide range of real-world LLM applications.

Abstract

Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

TL;DR

Abstract

Paper Structure (40 sections, 4 equations, 11 figures, 11 tables)

This paper contains 40 sections, 4 equations, 11 figures, 11 tables.

Introduction
Related Work
Understanding and Improving ICL.
Bias in ICL and Calibrating LLMs.
A Systematic Analysis of Calibration
Bias in Prompting and In-Context Learning (ICL)
Overview of ICL Calibration Methods.
Contextual Calibration zhao2021calibrate (CC).
Domain-Context Calibration fei-etal-2023-mitigating (DC).
Prototypical Calibration han2023prototypical (PC).
Design Principles Behind Calibrations
What Constitutes a Better Decision Boundary for Calibrations?
Is Content-free Input a Good Estimator of the Contextual Prior?
Batch Calibration
Batch Calibration (BC).
...and 25 more sections

Figures (11)

Figure 1: Batch Calibration (BC) achieves the best performance on 1-shot ICL over calibration baselines on an average of 13 classification tasks on PaLM 2-S and PaLM 2-L anil2023palm.
Figure 2: Visualization of the decision boundaries of uncalibrated ICL, and after applying existing calibration methods and the proposed BC (to be introduced in Sec \ref{['fig:method']}) in representative binary classification tasks of SST-2 (top row) socher-etal-2013-recursive and QNLI (bottom row) wang-etal-2018-glue on 1-shot PaLM 2-S. We show success and failure cases for each baseline method (CC, DC, and PC), whereas BC is consistently effective. Refer to Appendix §\ref{['app:aexp']} for more examples.
Figure 3: The distribution of ICL scores after applying CC and DC on QNLI. Due to an unfair content-free prior, the prediction by 1-shot PaLM-2 is biased towards entailment.
Figure 4: Illustration of Batch Calibration (BC). Batches of demonstrations with in-context examples and test samples are passed into the LLM. Due to implicit bias sources in the context, the score distribution from the LLM becomes highly biased. BC is a modular and adaptable layer option appended to the output of the LLM/VLM. BC generates calibrated scores according to Eq. \ref{['eq:mean']} & \ref{['eq:overprior']}. Highlighted symbols indicate the distribution means (visualized for illustration only).
Figure 5: BC benefits from labeled data: The performance of the adjustable BCL compared to the zero-shot BC with a changing strength. The strength$\gamma$ at 0 and 1 represent the uncalibrated ICL and BC, respectively. We highlight the optimal strength learned from a labeled set and the best test strength. Refer to Appendix §\ref{['app:aexp']} for more examples.
...and 6 more figures

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

TL;DR

Abstract

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (11)