Table of Contents
Fetching ...

Monte Carlo Sampling for Analyzing In-Context Examples

Stephanie Schoch, Yangfeng Ji

TL;DR

This work tackles the brittleness of in-context learning by systematically studying how the number of demonstrations interacts with the order and the specific exemplars chosen. It introduces a Monte Carlo sampling framework that incrementally adds exemplars while averaging over permutations, enabling robust estimation of performance as a function of $k$ while mitigating confounds from ordering and selection. The results show that earlier observed performance plateaus do not consistently generalize across permutations, and one-shot performance hinges on the particular exemplar used; moreover, exemplar selection via Monte Carlo can improve robustness but may underperform random sampling in overall accuracy. These insights have practical implications for prompting strategies, suggesting a nuanced trade-off between robustness to prompting variability and maximized immediate performance, especially under restricted context windows.

Abstract

Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.

Monte Carlo Sampling for Analyzing In-Context Examples

TL;DR

This work tackles the brittleness of in-context learning by systematically studying how the number of demonstrations interacts with the order and the specific exemplars chosen. It introduces a Monte Carlo sampling framework that incrementally adds exemplars while averaging over permutations, enabling robust estimation of performance as a function of while mitigating confounds from ordering and selection. The results show that earlier observed performance plateaus do not consistently generalize across permutations, and one-shot performance hinges on the particular exemplar used; moreover, exemplar selection via Monte Carlo can improve robustness but may underperform random sampling in overall accuracy. These insights have practical implications for prompting strategies, suggesting a nuanced trade-off between robustness to prompting variability and maximized immediate performance, especially under restricted context windows.

Abstract

Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.

Paper Structure

This paper contains 21 sections, 1 equation, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: In-context performance for each dataset and model. Results show the average of 20 permutations at each step $k$ in the proposed Monte Carlo sampling method. Shaded regions show standard deviation of 5 trials.
  • Figure 2: Results on SST-2. Blue lines represent individual permutations and red line indicates average across all permutations within one trial.
  • Figure 3: One-shot MNLI performance across 5 trials. Each blue point represents the accuracy using the first exemplar in a permutation. Red points indicate zero-shot performance. Results show that zero-shot settings can outperform one-shot settings, dependent upon the selected example.
  • Figure 4: Performance with Llama2-13B on QNLI dataset, using in-context subsets containing the highest-performing and lowest-performing data points on average from \ref{['subsec:number']}, along with a random baseline. Results represent 20 permutations, with standard deviation displayed as the shaded region for each line.
  • Figure 5: One-shot in-context learning performance on the Hellaswag dataset across 5 trials.
  • ...and 15 more figures