Table of Contents
Fetching ...

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

Ruihang Li, Yixuan Wei, Miaosen Zhang, Nenghai Yu, Han Hu, Houwen Peng

TL;DR

ScalingFilter introduces a reference-free data quality filter that uses perplexity differences between two meta-models to quantify data quality via a quality factor linked to model scaling laws. By top-k selecting high-quality samples based on this factor, it enables training a 1.3B model on 25B tokens that achieves better zero-shot performance and greater semantic diversity than baselines. The work leverages a theoretical connection between data quality and scaling exponents, and introduces semantic diversity as a robust measure of dataset richness. Overall, ScalingFilter offers a principled, bias-reducing approach to data curation with practical gains in downstream task performance and data diversity, while acknowledging computational costs and limitations related to broader applicability and biases.

Abstract

High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

TL;DR

ScalingFilter introduces a reference-free data quality filter that uses perplexity differences between two meta-models to quantify data quality via a quality factor linked to model scaling laws. By top-k selecting high-quality samples based on this factor, it enables training a 1.3B model on 25B tokens that achieves better zero-shot performance and greater semantic diversity than baselines. The work leverages a theoretical connection between data quality and scaling exponents, and introduces semantic diversity as a robust measure of dataset richness. Overall, ScalingFilter offers a principled, bias-reducing approach to data curation with practical gains in downstream task performance and data diversity, while acknowledging computational costs and limitations related to broader applicability and biases.

Abstract

High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.
Paper Structure (14 sections, 21 equations, 4 figures, 8 tables)

This paper contains 14 sections, 21 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: In ScalingFilter, we assess the quality of text documents by their scaling characteristics with language models in different sizes.
  • Figure 2: (a) A visual diagram illustrates the theoretical result that high-quality data accelerates the rate of loss decrease as model parameters increase, resulting in larger model scaling exponents $a$. (b) We calculated the average loss of GPT-2 models of different sizes on several datasets with recognized quality levels: Wikipedia, OpenWebText, and Books3 represent high-quality data, while Unfiltered CommonCrawl represents low-quality data. The results closely align with the theoretical analysis shown in (a), which indicates that high-quality data accelerates the rate of loss decrease as model parameters increase.
  • Figure 3: Positive correlation between the number of datasets and semantic diversity, demonstrating semantic diversity as a reliable measure of data diversity.
  • Figure 4: Results on the relationship between semantic diversity and sample size. Semantic diversity stabilizes at a sample size of 10,000, with a standard deviation below 0.2. Therefore, we choose 10,000 as our sample size for calculating semantic diversity, as it represents the dataset's diversity adequately while ensuring computational efficiency.