Table of Contents
Fetching ...

On the Relationship Between the Choice of Representation and In-Context Learning

Ioana Marinescu, Kyunghyun Cho, Eric Karl Oermann

TL;DR

This work investigates how the representation of in-context demonstrations and the quantity of demonstrations jointly shape in-context learning (ICL) in large language models. It introduces a label-representation optimization that enumerates label sets with varying semantic relevance and evaluates ICL on sentiment classification across 3-way and 5-way tasks using multiple model sizes. The key finding is that learning from demonstrations occurs across representations, but the baseline accuracy and attainable range are determined by the label representation, with learning efficiency additionally modulated by model size; the representation ranking remains stable across different numbers of demonstrations. The study argues for treating representation and demonstration quantity as separate levers in prompt design and provides practical guidance for selecting semantically meaningful class names to maximize ICL performance.

Abstract

In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.

On the Relationship Between the Choice of Representation and In-Context Learning

TL;DR

This work investigates how the representation of in-context demonstrations and the quantity of demonstrations jointly shape in-context learning (ICL) in large language models. It introduces a label-representation optimization that enumerates label sets with varying semantic relevance and evaluates ICL on sentiment classification across 3-way and 5-way tasks using multiple model sizes. The key finding is that learning from demonstrations occurs across representations, but the baseline accuracy and attainable range are determined by the label representation, with learning efficiency additionally modulated by model size; the representation ranking remains stable across different numbers of demonstrations. The study argues for treating representation and demonstration quantity as separate levers in prompt design and provides practical guidance for selecting semantically meaningful class names to maximize ICL performance.

Abstract

In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.

Paper Structure

This paper contains 26 sections, 5 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Method overview. Step 1: We develop an optimization algorithm to enumerate a list of possible label sets for a sentiment classification task. Step 2: We label demonstration sentences according to the label sets found. Step 3: We use these demonstrations in ICL tasks and evaluate the performance obtained with each label set on the same set of test sentences.
  • Figure 2: Accuracy vs. number of demonstrations across model sizes for (a) 3-class and (b) 5-class settings. The curves were smoothed with a window size of 10, with error bars showing 95% CI over 10 runs. The legend shows the number of labeling examples $K$ used to fit the label set. Different $K$ values may result in the same label sets. For these sets, the color shown is that of the higher $K$.
  • Figure 3: Ranking correlation coefficient between the zero-shot accuracy and the $N$-shot accuracy vs. $N$ number of demonstrations. $N \in \{\text{num classes},...40\}$ for 1B and 8B models, $N \in \{10, 20, 30, 40\}$ for 70B model. The CI are computed over 1000 bootstrapping samples from 10 runs per N-shot accuracy. The order of label sets in terms of quality stays consistent across $N$-shot experiments.
  • Figure 4: Evaluation of learning curves for label sets obtained with different $K$ labeling examples. Ranking correlation coefficient between $N$ and $N$-shot accuracy vs. zero-shot accuracy for each curve. $N \in \{\text{num classes},...40\}$ for 1B and 8B models, $N \in \{10, 20, 30, 40\}$ for 70B model. Higher correlation indicates that the accuracy for that curve is often strictly increasing with $N$ (steeper curve), while lower accuracy indicates that the accuracy can be plateauing or decreasing on some intervals (flatter curve). The CI are computed over 1000 bootstrapping samples from 10 runs per N-shot accuracy.
  • Figure 5: 3-way classification
  • ...and 1 more figures