Table of Contents
Fetching ...

Enhancing In-context Learning via Linear Probe Calibration

Momin Abbas, Yi Zhou, Parikshit Ram, Nathalie Baracaldo, Horst Samulowitz, Theodoros Salonidis, Tianyi Chen

TL;DR

The paper tackles unreliability in in-context learning (ICL) by showing that vanilla ICL predictions often have high Shannon entropy, signaling low confidence. It introduces Linear Probe Calibration (LinC), which applies a small affine transformation $\tilde{\mathbf{p}} = \mathrm{softmax}(\mathbf{A}\mathbf{p} + \mathbf{b})$ to output probabilities and learns $\mathbf{A}, \mathbf{b}$ from a tiny validation set, requiring as few as five labeled samples. Empirically, LinC yields up to $21\%$ average improvements over vanilla ICL and can reach $50\%$ gains on certain tasks, while also reducing calibration error and variance and boosting PEFT methods in low-resource regimes. The approach remains lightweight, API-friendly, and robust to prompts, demonstrating notable practical impact for reliable, scalable ICL in real-world settings.

Abstract

In-context learning (ICL) is a new paradigm for natural language processing that utilizes Generative Pre-trained Transformer (GPT)-like models. This approach uses prompts that include in-context demonstrations to generate the corresponding output for a new query input. However, applying ICL in real cases does not scale with the number of samples, and lacks robustness to different prompt templates and demonstration permutations. In this paper, we first show that GPT-like models using ICL result in unreliable predictions based on a new metric based on Shannon entropy. Then, to solve this problem, we propose a new technique called the Linear Probe Calibration (LinC), a method that calibrates the model's output probabilities, resulting in reliable predictions and improved performance, while requiring only minimal additional samples (as few as five labeled data samples). LinC significantly enhances the ICL test performance of GPT models on various benchmark datasets, with an average improvement of up to 21%, and up to a 50% improvement in some cases, and significantly boosts the performance of PEFT methods, especially in the low resource regime. Moreover, LinC achieves lower expected calibration error, and is highly robust to varying label proportions, prompt templates, and demonstration permutations. Our code is available at \url{https://github.com/mominabbass/LinC}.

Enhancing In-context Learning via Linear Probe Calibration

TL;DR

The paper tackles unreliability in in-context learning (ICL) by showing that vanilla ICL predictions often have high Shannon entropy, signaling low confidence. It introduces Linear Probe Calibration (LinC), which applies a small affine transformation to output probabilities and learns from a tiny validation set, requiring as few as five labeled samples. Empirically, LinC yields up to average improvements over vanilla ICL and can reach gains on certain tasks, while also reducing calibration error and variance and boosting PEFT methods in low-resource regimes. The approach remains lightweight, API-friendly, and robust to prompts, demonstrating notable practical impact for reliable, scalable ICL in real-world settings.

Abstract

In-context learning (ICL) is a new paradigm for natural language processing that utilizes Generative Pre-trained Transformer (GPT)-like models. This approach uses prompts that include in-context demonstrations to generate the corresponding output for a new query input. However, applying ICL in real cases does not scale with the number of samples, and lacks robustness to different prompt templates and demonstration permutations. In this paper, we first show that GPT-like models using ICL result in unreliable predictions based on a new metric based on Shannon entropy. Then, to solve this problem, we propose a new technique called the Linear Probe Calibration (LinC), a method that calibrates the model's output probabilities, resulting in reliable predictions and improved performance, while requiring only minimal additional samples (as few as five labeled data samples). LinC significantly enhances the ICL test performance of GPT models on various benchmark datasets, with an average improvement of up to 21%, and up to a 50% improvement in some cases, and significantly boosts the performance of PEFT methods, especially in the low resource regime. Moreover, LinC achieves lower expected calibration error, and is highly robust to varying label proportions, prompt templates, and demonstration permutations. Our code is available at \url{https://github.com/mominabbass/LinC}.
Paper Structure (17 sections, 9 equations, 15 figures, 14 tables, 1 algorithm)

This paper contains 17 sections, 9 equations, 15 figures, 14 tables, 1 algorithm.

Figures (15)

  • Figure 1: Example of ICL with a LLM $\theta^*$.
  • Figure 2: The efficacy of ICL is restricted by GPT tokenizer's maximum sequence length limit. Black-dashed lines demarcate the point beyond which additional shots cannot be utilized.
  • Figure 3: Shannon entropy histograms of using vanilla ICL on GPT-2-XL (1.5B) vs our method on SST-2 (higher entropy implies higher uncertainty); we use logarithmic base two. Refer to Section \ref{['sec:shannon']} for a detailed explanation.
  • Figure 4: LinC outperforms ICL on all few-shot experiments, and substantially enhances PEFT, especially in the low resource regime, while maintaining almost identical data and compute requirements.
  • Figure 5: Comparison across six different templates.
  • ...and 10 more figures