Table of Contents
Fetching ...

Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning

Zhe Yang, Damai Dai, Peiyi Wang, Zhifang Sui

TL;DR

This paper investigates how to determine approximately optimal weights for demonstration examples and how to apply them during ICL, and designs a masked self-prediction (MSP) score that exhibits a strong correlation with the final ICL performance.

Abstract

Large Language Models (LLMs) have recently gained the In-Context Learning (ICL) ability with the models scaling up, allowing them to quickly adapt to downstream tasks with only a few demonstration examples prepended in the input sequence. Nonetheless, the current practice of ICL treats all demonstration examples equally, which still warrants improvement, as the quality of examples is usually uneven. In this paper, we investigate how to determine approximately optimal weights for demonstration examples and how to apply them during ICL. To assess the quality of weights in the absence of additional validation data, we design a masked self-prediction (MSP) score that exhibits a strong correlation with the final ICL performance. To expedite the weight-searching process, we discretize the continuous weight space and adopt beam search. With approximately optimal weights obtained, we further propose two strategies to apply them to demonstrations at different model positions. Experimental results on 8 text classification tasks show that our approach outperforms conventional ICL by a large margin. Our code are publicly available at https:github.com/Zhe-Young/WICL.

Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning

TL;DR

This paper investigates how to determine approximately optimal weights for demonstration examples and how to apply them during ICL, and designs a masked self-prediction (MSP) score that exhibits a strong correlation with the final ICL performance.

Abstract

Large Language Models (LLMs) have recently gained the In-Context Learning (ICL) ability with the models scaling up, allowing them to quickly adapt to downstream tasks with only a few demonstration examples prepended in the input sequence. Nonetheless, the current practice of ICL treats all demonstration examples equally, which still warrants improvement, as the quality of examples is usually uneven. In this paper, we investigate how to determine approximately optimal weights for demonstration examples and how to apply them during ICL. To assess the quality of weights in the absence of additional validation data, we design a masked self-prediction (MSP) score that exhibits a strong correlation with the final ICL performance. To expedite the weight-searching process, we discretize the continuous weight space and adopt beam search. With approximately optimal weights obtained, we further propose two strategies to apply them to demonstrations at different model positions. Experimental results on 8 text classification tasks show that our approach outperforms conventional ICL by a large margin. Our code are publicly available at https:github.com/Zhe-Young/WICL.
Paper Structure (25 sections, 10 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 10 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: 4-shot ICL performance on SST2 with different sets of weights assigned to demonstration examples. Gray points represent the accuracy of different weight sets, and the red point denotes the accuracy of the non-weighting strategy. The performance varies significantly with different weights.
  • Figure 2: Regression line of MSP score and accuracy on MR dataset. Each point denotes the performance of a weight vector, the Pearson correlation coefficient is 0.73, indicating a strong correlation between MSP score and accuracy.
  • Figure 3: An illustration of weighted in-context learning. Reweighting at the self-attention layer could be scaling key matrix or scaling attention weights. The example weights can be obtained by beam search with masked self-prediction score as an indicator, which shows a strong correlation with final performance.
  • Figure 4: Correlation of MSP and accuracy under different example weights. For each task, we randomly sample 50 legal weights under 8-shot setting and test accuracy on GPT-1.3B, showing scatter plots and regression lines.
  • Figure 5: An illustration of beam search for example weights. We take 4-shot setting, beam size = 2 as an example, and legal weight set for each example is{0.8,1.0,1.2}. In each step, we extend beam states and preserve the 2 states with max MSP score.
  • ...and 4 more figures