Table of Contents
Fetching ...

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Erik Henriksson, Otto Tarkka, Filip Ginter

TL;DR

This paper tackles data quality for large language model (LLM) training by introducing an LLM-based line-level filtering pipeline. It uses GPT-4o mini to label a 20k-document FineWeb sample at the line level, deriving 382 descriptive low-quality labels that are then grouped into 9 broad categories and scaled with a DeBERTa-v3 classifier to filter a 10B-token subset (FineWeb-10BT). Calibrated quality scoring is applied before evaluating GPT-2 models on cleaned versus original data, with results showing higher HellaSwag accuracy and significantly faster convergence when trained on the filtered data, even with up to 25% data reduction. The work demonstrates the value of data-quality-focused preprocessing for data-efficient, greener LLM training and releases FinerWeb-10BT and the accompanying code for reproducibility and further research.

Abstract

Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25\% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

TL;DR

This paper tackles data quality for large language model (LLM) training by introducing an LLM-based line-level filtering pipeline. It uses GPT-4o mini to label a 20k-document FineWeb sample at the line level, deriving 382 descriptive low-quality labels that are then grouped into 9 broad categories and scaled with a DeBERTa-v3 classifier to filter a 10B-token subset (FineWeb-10BT). Calibrated quality scoring is applied before evaluating GPT-2 models on cleaned versus original data, with results showing higher HellaSwag accuracy and significantly faster convergence when trained on the filtered data, even with up to 25% data reduction. The work demonstrates the value of data-quality-focused preprocessing for data-efficient, greener LLM training and releases FinerWeb-10BT and the accompanying code for reproducibility and further research.

Abstract

Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25\% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.
Paper Structure (11 sections, 4 figures, 4 tables)

This paper contains 11 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: UMAP plot of embeddings of the 50 most frequent LLM-generated label names, created using the Stella-en-400M-v5 model.
  • Figure 2: Confusion matrix of predictions from our line quality classifier on the test set.
  • Figure 3: Quality probabilities for a 1M-line sample from FineWeb-10BT, binned in 10% intervals (log scale). A total of 8% of lines fall below the 0.50 quality threshold, and 25% fall below the 0.90 threshold.
  • Figure 4: Average HellaSwag accuracy over 5 runs for three models: the original FineWeb-10BT and two cleaned versions with quality thresholds of 0.50 (8% data reduction) and 0.90 (25% data reduction). Dot markers indicate epoch ends for each dataset run. GPT-2 (124M) checkpoint accuracy is shown for reference.