Table of Contents
Fetching ...

Revisiting Demonstration Selection Strategies in In-Context Learning

Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, Dacheng Tao

TL;DR

In-context learning performance is highly sensitive to demonstration choice due to entangled data and model effects. The authors hypothesize that effective demonstrations reduce the test-input uncertainty perceived by the inference module and propose TopK+ConE, a two-stage, data- and model-aware selection method that first narrows candidates with TopK and then ranks them by conditional-entropy-based criteria. Across 7 NLU tasks and 4 translation tasks, spanning GPT2-XL to Llama2-13B and aligned chat models, the approach yields consistent improvements over strong baselines, and analyses suggest it provides a unified explanation for prior ICL methods while remaining robust to mix-domain demonstrations. The work offers practical guidance for demonstration selection in real-world LLM deployment and includes release plans for code to enable broader adoption.

Abstract

Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL), where a few examples are used to describe a task to the model. However, the performance of ICL varies significantly with the choice of demonstrations, and it is still unclear why this happens or what factors will influence its choice. In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent. We further proposed a data- and model-dependent demonstration selection method, \textbf{TopK + ConE}, based on the assumption that \textit{the performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples}, resulting in a simple and effective recipe for ICL. Empirically, our method yields consistent improvements in both language understanding and generation tasks with different model scales. Further analyses confirm that, besides the generality and stability under different circumstances, our method provides a unified explanation for the effectiveness of previous methods. Code will be released.

Revisiting Demonstration Selection Strategies in In-Context Learning

TL;DR

In-context learning performance is highly sensitive to demonstration choice due to entangled data and model effects. The authors hypothesize that effective demonstrations reduce the test-input uncertainty perceived by the inference module and propose TopK+ConE, a two-stage, data- and model-aware selection method that first narrows candidates with TopK and then ranks them by conditional-entropy-based criteria. Across 7 NLU tasks and 4 translation tasks, spanning GPT2-XL to Llama2-13B and aligned chat models, the approach yields consistent improvements over strong baselines, and analyses suggest it provides a unified explanation for prior ICL methods while remaining robust to mix-domain demonstrations. The work offers practical guidance for demonstration selection in real-world LLM deployment and includes release plans for code to enable broader adoption.

Abstract

Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL), where a few examples are used to describe a task to the model. However, the performance of ICL varies significantly with the choice of demonstrations, and it is still unclear why this happens or what factors will influence its choice. In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent. We further proposed a data- and model-dependent demonstration selection method, \textbf{TopK + ConE}, based on the assumption that \textit{the performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples}, resulting in a simple and effective recipe for ICL. Empirically, our method yields consistent improvements in both language understanding and generation tasks with different model scales. Further analyses confirm that, besides the generality and stability under different circumstances, our method provides a unified explanation for the effectiveness of previous methods. Code will be released.
Paper Structure (34 sections, 2 equations, 7 figures, 7 tables)

This paper contains 34 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The different 8-shot performance of data-dependent methods (BM25 and TopK) and Our methods in SST-2. The colour in the number represents the relative performance between BM25 and TopK. We see that: 1) The data-dependent methods can not obtain optimal demonstrations under different models; 2) Our data- and model-dependent methods can achieve consistent improvement across different models.
  • Figure 2: The 1-shot performance with different retrieval models on two classification datasets.
  • Figure 3: The performance of different inference models with three randomly sampled demonstrations for SST-2 and SST-5 datasets. Model1, Model2, Model3 represent GPT-J-6B, LLAMA2-7B, and LLAMA2-13B, respectively. The impact of various demonstrations varies depending on the specific inference models.
  • Figure 4: The average performance of 7 NLU tasks across different model scales. Our method consistently outperforms previous methods across model scales.
  • Figure 5: The average performance of different chat models in 7 NLU tasks.
  • ...and 2 more figures