Table of Contents
Fetching ...

Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Yeongbin Seo, Gayoung Kim, Jaehyung Kim, Jinyoung Yeo

TL;DR

This work proposes a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density.

Abstract

As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision

Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

TL;DR

This work proposes a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density.

Abstract

As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision

Paper Structure

This paper contains 81 sections, 5 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: The line graph shows the logarithm of token priors (based on the GPT-2 tokenizer) computed from the Dolma dataset, sorted in descending order. The boxed regions highlight tokens from the top, middle, and bottom segments of the rank.
  • Figure 2: The line graph displays the values of $\mu_{\mathtt{d}}$ and $\sigma_{\mathtt{d}}$ computed from token priors in the Dolma dataset, sorted in descending order. Boxes are outlier samples from both distributions.
  • Figure 5: Extreme outlier samples selected based on three criteria, ensuring that each sample comes from a distinct criterion: PPL, $\mu_d$, and $\sigma_d$. ✓ indicates filtered out.
  • Figure 6: Overlap between outliers based on $\mu_{\mathtt{d}}$ and $\sigma_{\mathtt{d}}$ with those based on PPL, when filtering the top and bottom $\frac{e}{2}\%$ of samples (X-axis: $e$).
  • Figure 7: Proportion of Chinese data classified as outliers (Y-axis), after mixing Chinese and English data at a ratio of $a:100$ ($a$ as X-axis). Outliers are the top and bottom 5% of $\mu_\d$.
  • ...and 6 more figures