Table of Contents
Fetching ...

Task Contamination: Language Models May Not Be Few-Shot Anymore

Changmao Li, Jeffrey Flanigan

TL;DR

The paper tackles task contamination in large language models by analyzing how exposure to task data during pretraining biases zero-shot and few-shot evaluations. It employs four complementary methods—chronological analysis, training data inspection, task example extraction, and membership inference—across 12 models and 16 tasks to quantify contamination and its temporal dynamics. The findings show that pre-collection datasets are more likely to surpass the majority baseline, implying contamination, while tasks without contamination rarely beat simple baselines, highlighting reliability concerns for current baselines. The work underscores the need for transparent training-data disclosures and robust evaluation protocols to avoid inflated claims and to enable accurate assessment of LLM capabilities.

Abstract

Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

Task Contamination: Language Models May Not Be Few-Shot Anymore

TL;DR

The paper tackles task contamination in large language models by analyzing how exposure to task data during pretraining biases zero-shot and few-shot evaluations. It employs four complementary methods—chronological analysis, training data inspection, task example extraction, and membership inference—across 12 models and 16 tasks to quantify contamination and its temporal dynamics. The findings show that pre-collection datasets are more likely to surpass the majority baseline, implying contamination, while tasks without contamination rarely beat simple baselines, highlighting reliability concerns for current baselines. The work underscores the need for transparent training-data disclosures and robust evaluation protocols to avoid inflated claims and to enable accurate assessment of LLM capabilities.

Abstract

Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.
Paper Structure (28 sections, 9 figures, 10 tables)

This paper contains 28 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Percentage of datasets with accuracy higher than the majority baseline for datasets released prior and post LLM training data collection date, for both zero-shot (blue, left) and few-shot (green, right). Results are across all models and all datasets. On datasets released post training data collection date for the LLM, the LLM is much less likely to improve upon the simple majority baseline. Stat. sig. (darker) is the percent of datasets for which the performance above majority baseline is significant at the 99% confidence level.
  • Figure 2: Percentage of datasets larger than majority baselines for each LLM (light color), as well as the percentage of tasks for which training data can be extracted with an instruction prompt (Red, see also Table \ref{['tab:extract']}). Dark color is the percentage of datasets significantly larger ($p=.99$) than the majority baseline using a t-test. Below each LLM, we list the training data collection year, and the total number of datasets in pre- or post-collection in parenthesis (e.g. MoE has 7 datasets post training collection date.) For tasks without demonstrated possibility of task contamination (post-collection datasets (b) and (d), with no extracted task examples in red), models rarely show statistically significant improvements over majority baselines (see §\ref{['sec:no_contamination']} for details).
  • Figure 3: Average performance on datasets pre/post-2021. In the $x$ axis, LLMs are ordered chronologically by training data collection date (collection year is listed below the LLM).
  • Figure 4: Task accuracy of a fine-tuned LLM baseline vs. task release year. $R^2 = .001$, which indicates that the task difficulty for our datasets does not increase over time.
  • Figure 5: The number of generated examples which exactly match the original set and the performance (accuracy).
  • ...and 4 more figures