Monte Carlo Sampling for Analyzing In-Context Examples
Stephanie Schoch, Yangfeng Ji
TL;DR
This work tackles the brittleness of in-context learning by systematically studying how the number of demonstrations interacts with the order and the specific exemplars chosen. It introduces a Monte Carlo sampling framework that incrementally adds exemplars while averaging over permutations, enabling robust estimation of performance as a function of $k$ while mitigating confounds from ordering and selection. The results show that earlier observed performance plateaus do not consistently generalize across permutations, and one-shot performance hinges on the particular exemplar used; moreover, exemplar selection via Monte Carlo can improve robustness but may underperform random sampling in overall accuracy. These insights have practical implications for prompting strategies, suggesting a nuanced trade-off between robustness to prompting variability and maximized immediate performance, especially under restricted context windows.
Abstract
Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.
