Table of Contents
Fetching ...

Order Matters: Rethinking Prompt Construction in In-Context Learning

Warren Li, Yiqian Wang, Zihan Wang, Jingbo Shang

TL;DR

This paper investigates whether the order of in-context examples matters as much as which examples are chosen in few-shot prompting. Through controlled experiments across classification and generation tasks on open-source models (0.5B–27B) and GPT-5, the authors show that demonstration ordering introduces variance comparable to example selection. They demonstrate that near-optimal orderings can be identified from a modest development set, achieving most of the oracle test performance, but that such orderings do not reliably transfer across datasets. The work argues for treating prompt ordering as a first-class design choice in ICL and outlines directions for future research on larger models, multilingual settings, and new task formats.

Abstract

In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.

Order Matters: Rethinking Prompt Construction in In-Context Learning

TL;DR

This paper investigates whether the order of in-context examples matters as much as which examples are chosen in few-shot prompting. Through controlled experiments across classification and generation tasks on open-source models (0.5B–27B) and GPT-5, the authors show that demonstration ordering introduces variance comparable to example selection. They demonstrate that near-optimal orderings can be identified from a modest development set, achieving most of the oracle test performance, but that such orderings do not reliably transfer across datasets. The work argues for treating prompt ordering as a first-class design choice in ICL and outlines directions for future research on larger models, multilingual settings, and new task formats.

Abstract

In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.

Paper Structure

This paper contains 16 sections, 3 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: Measuring selection vs. ordering sensitivity via average grouped standard deviation.
  • Figure 2: Violin plots of the general distributions of ordering sensitivity and selection sensitivity across all model–dataset combinations.
  • Figure 3: Max test accuracy (orange), highest‐dev test accuracy (purple), and average test accuracy (green) w.r.t. different parameter values. The scores are aggregated over classification and generation tasks.