Table of Contents
Fetching ...

Revisiting In-Context Learning with Long Context Language Models

Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob Oh, Siddharth Dalmia, Prateek Kolhar

TL;DR

This work re-evaluates In-Context Learning when using Long-Context Language Models that can handle millions of tokens in a single prompt. It systematically compares traditional sample-selection strategies (relevance, diversity, curriculum, hard) against a simple random baseline across 18 datasets and multiple LCLMs, finding that sophisticated selection offers little to no advantage in many-shot scenarios. To address underutilization of extended context when data are scarce, the authors propose a data augmentation approach that generates and filters synthetic demonstrations, significantly boosting performance while preserving efficiency through caching. Additional analyses reveal that excessively long contexts can harm performance, robustness to noise is task-dependent, and the main practical takeaway is a shift from sample selection toward maximizing context usage and data diversity in the extended-context ICL regime.

Abstract

In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we discover that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.

Revisiting In-Context Learning with Long Context Language Models

TL;DR

This work re-evaluates In-Context Learning when using Long-Context Language Models that can handle millions of tokens in a single prompt. It systematically compares traditional sample-selection strategies (relevance, diversity, curriculum, hard) against a simple random baseline across 18 datasets and multiple LCLMs, finding that sophisticated selection offers little to no advantage in many-shot scenarios. To address underutilization of extended context when data are scarce, the authors propose a data augmentation approach that generates and filters synthetic demonstrations, significantly boosting performance while preserving efficiency through caching. Additional analyses reveal that excessively long contexts can harm performance, robustness to noise is task-dependent, and the main practical takeaway is a shift from sample selection toward maximizing context usage and data diversity in the extended-context ICL regime.

Abstract

In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we discover that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.

Paper Structure

This paper contains 39 sections, 1 equation, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Results of various sample selection approaches in 64-shot ICL with LCLMs. Approaches include Retrieval that selects examples similar to the target query, Diversity that aims for maximizing example variety, Curriculum that arranges examples in order from easiest to hardest, and Hard that uses only challenging examples, alongside Random that selects examples without any constraints. Results indicate that sample selection methods provide no significant improvement over the naive (random) approach and sometimes perform worse. Meanwhile, Augmentation refers to the approach that generates additional demonstrations and uses them along with original samples for ICL, particularly for low-resource tasks (such as translation, reasoning, and classification) that do not contain enough samples to utilize the full capacity of LCLMs, showing substantial performance gains.
  • Figure 2: Results of various sample selection approaches on ICL of 64 examples with LCLMs, where we average the performance over all models: Gemini Pro, Gemini Flash, and Llama 3.1, across four different tasks with 18 datasets. Each bar represents the averaged performance, with the upper and lower limits indicating standard deviation. See Figure \ref{['fig:selection']} for results on each model.
  • Figure 3: Results with varying the number of examples for ICL with Gemini Pro, where we average the results for each task.
  • Figure 4: Ratios of convex hull volume of in-context examples to the full dataset with varying numbers of ICL examples.
  • Figure 5: Visualization of embedding-space with original and synthetic examples.
  • ...and 4 more figures