Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification

Kush Dubey

Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification

Kush Dubey

TL;DR

This study interrogates whether pretraining on unlabeled test-set text inflates few-shot text classification performance. It introduces three estimators (acc_extra, acc_test, acc_base) and applies a hierarchical Bayesian analysis to 25 tasks across BERT, GPT-2, and Mistral 7B in zero-shot settings, revealing a robust pretraining boost when using independent unlabeled data while showing no consistent evaluation bias from test-set pretraining ($E[\text{acctest} - \text{accextra}] \approx 0$). The analysis demonstrates substantial within-task and cross-task variance, and emphasizes the necessity of repeated subsampling to obtain stable, model- and task-agnostic conclusions. The findings suggest that releasing unlabeled test-set text does not inherently bias benchmark evaluations, but underscore the importance of rigorous experimental design and transparency in few-shot NLP assessments, particularly for LLM-based evaluations.

Abstract

Few-shot learning benchmarks are critical for evaluating modern NLP techniques. It is possible, however, that benchmarks favor methods which easily make use of unlabeled text, because researchers can use unlabeled text from the test set to pretrain their models. Given the dearth of research on this potential problem, we run experiments to quantify the bias caused by pretraining on unlabeled test set text instead of on unlabeled, independently drawn text. Controlled few-shot and zero-shot experiments on 25 classification tasks and 3 language models -- BERT, GPT-2, and Mistral 7B -- do not find evidence of overoptimism. Furthermore, we demonstrate the importance of repeated subsampling when studying few-shot text classification, and recommend that few-shot learning benchmarks include multiple training folds. Code and data are available at https://github.com/kddubey/pretrain-on-test/.

Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification

TL;DR

). The analysis demonstrates substantial within-task and cross-task variance, and emphasizes the necessity of repeated subsampling to obtain stable, model- and task-agnostic conclusions. The findings suggest that releasing unlabeled test-set text does not inherently bias benchmark evaluations, but underscore the importance of rigorous experimental design and transparency in few-shot NLP assessments, particularly for LLM-based evaluations.

Abstract

Paper Structure (30 sections, 4 equations, 20 figures, 2 tables)

This paper contains 30 sections, 4 equations, 20 figures, 2 tables.

Introduction
Motivation
Related work
Experimental design
acc extra
acc test
acc base
Repeated subsampling
Results
Analysis
Model
Overall effects
Task-level effects
Discussion
Overtraining
...and 15 more sections

Figures (20)

Figure 1: The experimental design (§ \ref{['sec:exp']}) for $n = 500$ as an example.
Figure 2: Pseudocode for the accuracy estimators defined in § \ref{['sec:exp']}.
Figure 3: Distributions of average accuracy differences \ref{['eq:marg']}. The evaluation bias is akin to acctest$-$ accextra. The pretraining boost is akin to accextra$-$ accbase.
Figure 4: Distributions of average evaluation biases \ref{['eq:cond']} for the subset of tasks which reported an average evaluation bias of at least +3% accuracy in any configuration of the experiment.
Figure 5: Average accuracy differences \ref{['eq:marg']} after pretraining GPT-2 for 2 epochs instead of 1 (§ \ref{['sec:overtraining']}).
...and 15 more figures

Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification

TL;DR

Abstract

Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification

Authors

TL;DR

Abstract

Table of Contents

Figures (20)