Table of Contents
Fetching ...

Token-based Decision Criteria Are Suboptimal in In-context Learning

Hakaze Cho, Yoshihiro Sakai, Mariko Kato, Kenshiro Tanaka, Akira Ishii, Naoya Inoue

TL;DR

This work critiques token-probability based decision criteria in in-context learning and introduces Hidden Calibration, a lightweight method that builds per-label centroids from the LM's last hidden states and predicts via nearest-centroid similarity. Across six models and ten datasets, Hidden Calibration consistently outperforms traditional token-based baselines and even a whole-vocabulary centroid method, achieving strong state-of-the-art results with limited computational overhead. The authors show that hidden-state centroids yield cleaner class separation and that demonstrations promote linear separability in the hidden space, offering new insights into the principles underlying ICL. The approach also demonstrates robustness to prompt design and demonstrates favorable efficiency in time, space, and data usage, suggesting practical benefits for real-world ICL deployments.

Abstract

In-Context Learning (ICL) typically utilizes classification criteria from output probabilities of manually selected label tokens. However, we argue that such token-based classification criteria lead to suboptimal decision boundaries, despite delicate calibrations through translation and constrained rotation applied. To address this problem, we propose Hidden Calibration, which renounces token probabilities and uses the nearest centroid classifier on the LM's last hidden states. In detail, we assign the label of the nearest centroid previously estimated from a calibration set to the test sample as the predicted label. Our experiments on 6 models and 10 classification datasets indicate that Hidden Calibration consistently outperforms current token-based baselines by about 20%~50%, achieving a strong state-of-the-art in ICL. Our further analysis demonstrates that Hidden Calibration finds better classification criteria with less inter-class overlap, and LMs provide linearly separable intra-class clusters with the help of demonstrations, which supports Hidden Calibration and gives new insights into the principle of ICL. Our official code implementation can be found at https://github.com/hc495/Hidden_Calibration.

Token-based Decision Criteria Are Suboptimal in In-context Learning

TL;DR

This work critiques token-probability based decision criteria in in-context learning and introduces Hidden Calibration, a lightweight method that builds per-label centroids from the LM's last hidden states and predicts via nearest-centroid similarity. Across six models and ten datasets, Hidden Calibration consistently outperforms traditional token-based baselines and even a whole-vocabulary centroid method, achieving strong state-of-the-art results with limited computational overhead. The authors show that hidden-state centroids yield cleaner class separation and that demonstrations promote linear separability in the hidden space, offering new insights into the principles underlying ICL. The approach also demonstrates robustness to prompt design and demonstrates favorable efficiency in time, space, and data usage, suggesting practical benefits for real-world ICL deployments.

Abstract

In-Context Learning (ICL) typically utilizes classification criteria from output probabilities of manually selected label tokens. However, we argue that such token-based classification criteria lead to suboptimal decision boundaries, despite delicate calibrations through translation and constrained rotation applied. To address this problem, we propose Hidden Calibration, which renounces token probabilities and uses the nearest centroid classifier on the LM's last hidden states. In detail, we assign the label of the nearest centroid previously estimated from a calibration set to the test sample as the predicted label. Our experiments on 6 models and 10 classification datasets indicate that Hidden Calibration consistently outperforms current token-based baselines by about 20%~50%, achieving a strong state-of-the-art in ICL. Our further analysis demonstrates that Hidden Calibration finds better classification criteria with less inter-class overlap, and LMs provide linearly separable intra-class clusters with the help of demonstrations, which supports Hidden Calibration and gives new insights into the principle of ICL. Our official code implementation can be found at https://github.com/hc495/Hidden_Calibration.
Paper Structure (68 sections, 16 equations, 15 figures, 13 tables)

This paper contains 68 sections, 16 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: In an ICL diagram, A. The prompt of ICL consists of a concatenation of demonstrations and a query. LMs encode the prompt into the last hidden state $h$, then B. (previous works) use the un-embedding vectors of the label tokens ($E^U_+$, $E^U_-$) to decode the $h$ to prediction $\hat{y}$, then calibrations are used to adjust the predicted logits. C. Our work uses the calibration dataset to calculate centroids ($\Bar{h}_+$, $\Bar{h}_-$) to decode the $h$.
  • Figure 2: Token probability-based decision boundaries (original & batch calibrated) are suboptimal comparing to centroid-based boundary. Points and contour lines are ICL's last hidden states and kernel densities mapped by Principal Component Analysis. Oblique coordinate axis is the direction of the un-embedding difference of label tokens $\left(E_+^U-E_-^U\right)$, where the kernel densities of mapped data points are plotted. The rotating calibration by $A\neq \mathbf{1}$ (e.g. Contextual Calibration, Domain Calibration) has a HTML]d8e4ealimited feasible mapping direction\ref{['footnote:3']}.
  • Figure 3: The diagram of Hidden Calibration. Step 1: Calculating the hidden state centroid of each label. Step 2: Find the label of the nearest centroid of the text sample to be the prediction.
  • Figure 4: The classification performance (Macro F1(%)) of 6 models averaged on 10 datasets. Hidden Calibration (Hidd.C) is a new state-of-the-art of ICL, where demonstrations consistently improve the performance.
  • Figure 5: Sensitivities on (left) prompt template, (middle) demonstration label distribution, and (right) demonstration order on Llama 2-6.9B and Rotten_Tomatoes. Legend is consistent with Fig. \ref{['fig:4_Mainres']}, omitted.
  • ...and 10 more figures