Table of Contents
Fetching ...

Data Curation Alone Can Stabilize In-context Learning

Ting-Yun Chang, Robin Jia

TL;DR

This work addresses the instability of in-context learning (ICL) caused by the random selection of training exemplars. It introduces two data-valuation methods, CondAcc and Datamodels, to curate a small stable subset of training data that yields consistently high ICL performance across tasks and models without altering the ICL mechanism. The curated subsets significantly improve average and worst-case accuracy (about 7.7% and 6.3% on average) and generalize to out-of-distribution data, while not relying on toxicity-promoting diversity or extreme perplexity. The findings emphasize the importance and feasibility of data-centric approaches to improve prompt-based learning, providing practical guidelines for constructing effective in-context prompts and highlighting potential directions for future data-aware prompting strategies.

Abstract

In-context learning (ICL) enables large language models (LLMs) to perform new tasks by prompting them with a sequence of training examples. However, it is known that ICL is very sensitive to the choice of training examples: randomly sampling examples from a training set leads to high variance in performance. In this paper, we show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm (e.g., prompt retrieval or calibration). We introduce two methods to choose training subsets -- both score training examples individually, then select the highest-scoring ones. CondAcc scores a training example by its average dev-set ICL accuracy when combined with random training examples, while Datamodels learns linear regressors that estimate how the presence of each training example influences LLM outputs. Across five tasks and two LLMs, sampling from stable subsets selected by CondAcc and Datamodels improves average accuracy over sampling from the entire training set by 7.7% and 6.3%, respectively. Surprisingly, the stable subset examples are not especially diverse in content or low in perplexity, in contrast with other work suggesting that diversity and perplexity are important when prompting LLMs.

Data Curation Alone Can Stabilize In-context Learning

TL;DR

This work addresses the instability of in-context learning (ICL) caused by the random selection of training exemplars. It introduces two data-valuation methods, CondAcc and Datamodels, to curate a small stable subset of training data that yields consistently high ICL performance across tasks and models without altering the ICL mechanism. The curated subsets significantly improve average and worst-case accuracy (about 7.7% and 6.3% on average) and generalize to out-of-distribution data, while not relying on toxicity-promoting diversity or extreme perplexity. The findings emphasize the importance and feasibility of data-centric approaches to improve prompt-based learning, providing practical guidelines for constructing effective in-context prompts and highlighting potential directions for future data-aware prompting strategies.

Abstract

In-context learning (ICL) enables large language models (LLMs) to perform new tasks by prompting them with a sequence of training examples. However, it is known that ICL is very sensitive to the choice of training examples: randomly sampling examples from a training set leads to high variance in performance. In this paper, we show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm (e.g., prompt retrieval or calibration). We introduce two methods to choose training subsets -- both score training examples individually, then select the highest-scoring ones. CondAcc scores a training example by its average dev-set ICL accuracy when combined with random training examples, while Datamodels learns linear regressors that estimate how the presence of each training example influences LLM outputs. Across five tasks and two LLMs, sampling from stable subsets selected by CondAcc and Datamodels improves average accuracy over sampling from the entire training set by 7.7% and 6.3%, respectively. Surprisingly, the stable subset examples are not especially diverse in content or low in perplexity, in contrast with other work suggesting that diversity and perplexity are important when prompting LLMs.
Paper Structure (42 sections, 6 equations, 8 figures, 12 tables)

This paper contains 42 sections, 6 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: 4-shot ICL performance of GPTJ on SST2. Each boxplot summarizes the results of 50 sampled prompts. Compared with baselines (blue), our methods (pink) can greatly stablilize performance, having higher average accuracy (red diamonds) and lower variance.
  • Figure 2: An overview of our CondAcc method, which scores training examples individually using its average accuracy (red diamonds) when combined with other random training examples. Each boxplot summarizes the dev-set accuracies conditioned on a training example appearing in the sampled prompts.
  • Figure 3: Accuracy versus sequence length (left) and accuracy versus perplexity (right). Each dot corresponds to a training example. Examples in good subsets are not outliers with abnormally long lengths or high perplexities.
  • Figure 4: Different ways to visualize the diversity of examples. (a) and (b) compare the diversity of the good subset, bad subset, and randomly sampled subsets (boxplot). For both DIV-I and DIV-F, a higher number means a subset is more diverse. Overall, good subsets are no more diverse than random subsets. (c) visualizes the stable training examples selected by CondAcc and Datamodels methods in Datamodels embeddings space, where each dot is a training example in AGNews. Both methods choose tightly cluttered examples instead of diverse ones.
  • Figure 5: The ground-truth outcomes of an LLM versus predicted outcomes of datamodels on the test set of datamodels, which contains a set of newly sampled prompts with unseen combinations of training examples. The high correlations show that our datamodels can make accurate predictions.
  • ...and 3 more figures