Table of Contents
Fetching ...

An Empirical Exploration in Quality Filtering of Text Data

Leo Gao

TL;DR

The work challenges the assumption that harsher data filtering from large internet corpora always yields better language model quality. By systematically varying a Pareto-based, shallow classifier filter and training a 1.3B GPT-Neo on 40 GB chunks, it reveals a non-monotonic relationship between filtering aggressiveness and downstream task performance across 13 tasks. The decline at high filtering levels is linked to misalignment between the proxy objective and true data quality (Goodhart's law) and is further associated with loss of domain-relevant content. The findings motivate developing more robust filtering objectives and conducting thorough analyses of data-curation choices to understand their practical impact on generalization.

Abstract

While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

An Empirical Exploration in Quality Filtering of Text Data

TL;DR

The work challenges the assumption that harsher data filtering from large internet corpora always yields better language model quality. By systematically varying a Pareto-based, shallow classifier filter and training a 1.3B GPT-Neo on 40 GB chunks, it reveals a non-monotonic relationship between filtering aggressiveness and downstream task performance across 13 tasks. The decline at high filtering levels is linked to misalignment between the proxy objective and true data quality (Goodhart's law) and is further associated with loss of domain-relevant content. The findings motivate developing more robust filtering objectives and conducting thorough analyses of data-curation choices to understand their practical impact on generalization.

Abstract

While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

Paper Structure

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Average accuracy across all 13 tasks for various different filtering ratios using a shallow quality classifier.The amount of data post-filtering is held constant. Although filtering improves performance at first, discarding more data can actually reduce accuracy, due to misalignment between filtering classifier objective and text quality.
  • Figure 2: Plots of results for all downstream tasks explored in this paper. Higher is better on all metrics except LAMBADA perplexity (first plot in the third row), where lower is better.
  • Figure 3: Fraction of documents in filtered Common Crawl classified as BookCorpus2-like by a shallow classifier trained to distinguish OpenWebtext and BookCorpus2. Note that this plot has a different x-axis scale from the task evaluation plots.
  • Figure 4: Fraction of documents in filtered Common Crawl classified as PubmedAbstracts-like by a shallow classifier trained to distinguish OpenWebtext and PubmedAbstracts. Note that this plot has a different x-axis scale from the task evaluation plots.