Table of Contents
Fetching ...

In-Context Learning (and Unlearning) of Length Biases

Stephanie Schoch, Yangfeng Ji

TL;DR

This work investigates whether in-context learning (ICL) in large language models can acquire length-based statistical biases present in data. The authors quantify how demonstration length, the number of demonstrations, and model size influence the emergence of length bias, using seven binary classification datasets and multiple model families. They demonstrate that length biases can be learned in-context and that longer demonstrations can increase the bias magnitude, even when the underlying finetuning did not exploit such cues. Importantly, they show that ICL can be used to debias finetuned models by sampling demonstrations from opposite-length tails or via random sampling, offering a cost-effective mechanism to mitigate biases without parameter updates. The findings underscore the need for balanced demonstration sampling in prompts and provide practical guidance for designing robust ICL pipelines and debiasing strategies.

Abstract

Large language models have demonstrated strong capabilities to learn in-context, where exemplar input-output pairings are appended to the prompt for demonstration. However, existing work has demonstrated the ability of models to learn lexical and label biases in-context, which negatively impacts both performance and robustness of models. The impact of other statistical data biases remains under-explored, which this work aims to address. We specifically investigate the impact of length biases on in-context learning. We demonstrate that models do learn length biases in the context window for their predictions, and further empirically analyze the factors that modulate the level of bias exhibited by the model. In addition, we show that learning length information in-context can be used to counter the length bias that has been encoded in models (e.g., via fine-tuning). This reveals the power of in-context learning in debiasing model prediction behaviors without the need for costly parameter updates.

In-Context Learning (and Unlearning) of Length Biases

TL;DR

This work investigates whether in-context learning (ICL) in large language models can acquire length-based statistical biases present in data. The authors quantify how demonstration length, the number of demonstrations, and model size influence the emergence of length bias, using seven binary classification datasets and multiple model families. They demonstrate that length biases can be learned in-context and that longer demonstrations can increase the bias magnitude, even when the underlying finetuning did not exploit such cues. Importantly, they show that ICL can be used to debias finetuned models by sampling demonstrations from opposite-length tails or via random sampling, offering a cost-effective mechanism to mitigate biases without parameter updates. The findings underscore the need for balanced demonstration sampling in prompts and provide practical guidance for designing robust ICL pipelines and debiasing strategies.

Abstract

Large language models have demonstrated strong capabilities to learn in-context, where exemplar input-output pairings are appended to the prompt for demonstration. However, existing work has demonstrated the ability of models to learn lexical and label biases in-context, which negatively impacts both performance and robustness of models. The impact of other statistical data biases remains under-explored, which this work aims to address. We specifically investigate the impact of length biases on in-context learning. We demonstrate that models do learn length biases in the context window for their predictions, and further empirically analyze the factors that modulate the level of bias exhibited by the model. In addition, we show that learning length information in-context can be used to counter the length bias that has been encoded in models (e.g., via fine-tuning). This reveals the power of in-context learning in debiasing model prediction behaviors without the need for costly parameter updates.

Paper Structure

This paper contains 33 sections, 78 figures, 7 tables.

Figures (78)

  • Figure 1: An illustration of our experiment setup and hypothesis. When sampling from the tails of the distribution (left of image), we introduce a data length bias. If the model can learn this shortcut feature in-context, we expect class performance on the data of similar length to be higher than data of the opposite length than what was seen in the context window (right of image).
  • Figure 2: An overview of in-context learning using $K$ input-output demonstrations concatenated to the test input $\{x_{test}, y_{test}\}$.
  • Figure 3: In-context learning validation performance across different models on the Hans dataset. For each graph, $y_1$ (Blue) was sampled from the short instances, and $y_2$ (Orange) was sampled from the long instances.
  • Figure 4: Finetuning validation performance across different models on the Hans dataset. For each graph, $y_1$ (Blue) was sampled from the short instances, and $y_2$ (Orange) was sampled from the long instances.
  • Figure 5: In-context learning validation performance across different models on the PAWS-X$_{\textsc{EN}}$ dataset. For each graph, $y_1$ (Blue) was sampled from the short instances, and $y_2$ (Orange) was sampled from the long instances.
  • ...and 73 more figures