Table of Contents
Fetching ...

Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

TL;DR

Blu-WERP presents a scalable preprocessing pipeline that maximizes training data quality from web-scale sources by integrating semantic-aware filtering, multi-level deduplication via Bloom filters, and benchmark-targeted classifier selection. The approach is validated with controlled ablations and a benchmark-driven evaluation framework, demonstrating superior aggregate performance and favorable scaling trajectories compared with state-of-the-art baselines. Key contributions include a BETR-based FastText classifier, a Bloom Filter-based deduplication strategy, and a scaling-law data selection protocol that predicts performance at larger model sizes without full-scale training. The results highlight the importance of data-centric AI design for improving LLM performance and efficiency, offering a practical, reproducible framework for researchers and practitioners.

Abstract

High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior performance across all model scales. At the 1B parameter scale, Relatively Blu-WERP demonstrates a 4.0% and 9.5% aggregate improvement over DCLM and Fineweb respectively, while achieving quality-per-token efficiency gain. Categorical analysis reveals 2.4% improvement in World Knowledge & Reasoning, 6.2% improvement in Language Understanding, and 4.2% improvement in Commonsense Reasoning. These results establish Blu-WERP as a state-of-the-art preprocessing pipeline that substantially improves LLM training data quality and downstream model performance with reduced computational cost. Our findings contribute to the growing body of research on data-centric AI, demonstrating that preprocessing pipeline design significantly impacts LLM capabilities. The Blu-WERP pipeline represents a practical advancement in data quality optimization, offering researchers and practitioners an effective solution for improving LLM training efficiency and model performance.

Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

TL;DR

Blu-WERP presents a scalable preprocessing pipeline that maximizes training data quality from web-scale sources by integrating semantic-aware filtering, multi-level deduplication via Bloom filters, and benchmark-targeted classifier selection. The approach is validated with controlled ablations and a benchmark-driven evaluation framework, demonstrating superior aggregate performance and favorable scaling trajectories compared with state-of-the-art baselines. Key contributions include a BETR-based FastText classifier, a Bloom Filter-based deduplication strategy, and a scaling-law data selection protocol that predicts performance at larger model sizes without full-scale training. The results highlight the importance of data-centric AI design for improving LLM performance and efficiency, offering a practical, reproducible framework for researchers and practitioners.

Abstract

High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior performance across all model scales. At the 1B parameter scale, Relatively Blu-WERP demonstrates a 4.0% and 9.5% aggregate improvement over DCLM and Fineweb respectively, while achieving quality-per-token efficiency gain. Categorical analysis reveals 2.4% improvement in World Knowledge & Reasoning, 6.2% improvement in Language Understanding, and 4.2% improvement in Commonsense Reasoning. These results establish Blu-WERP as a state-of-the-art preprocessing pipeline that substantially improves LLM training data quality and downstream model performance with reduced computational cost. Our findings contribute to the growing body of research on data-centric AI, demonstrating that preprocessing pipeline design significantly impacts LLM capabilities. The Blu-WERP pipeline represents a practical advancement in data quality optimization, offering researchers and practitioners an effective solution for improving LLM training efficiency and model performance.

Paper Structure

This paper contains 40 sections, 3 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Aggregate benchmark performance comparison across five datasets. Blu-WERP achieves 53.88% aggregate accuracy, outperforming DCLM 2 (51.81%) and other baselines.
  • Figure 2: Evaluation results across nine benchmarks from the standardized evaluation suite. Our dataset outperforms all other corpora on the majority of tasks, with competitive results comparable to DCLM. While performance on MMLU and SocialIQA slightly trails DCLM, our dataset achieves parity in benchmarks assessing world knowledge (MMLU, ARC Easy, ARC Challenge) and demonstrates superior results in language understanding (HellaSwag, SocialIQA) and common-sense reasoning (CSQA, PIQA).
  • Figure 3: Aggregate benchmark performance across four parser configurations. Justext achieves highest aggregate score (0.4474) with 49.96% retention, followed by Trafilatura (0.4429, 43.25% retention), Resiliparse (0.4402, 19.44% retention), and WET baseline (0.4057).
  • Figure 4: Deduplication Aggregate Score. The results indicate that MinHash-based deduplication alone does not yield optimal performance; its effectiveness increases significantly when combined with a Bloom filter component. Incorporating Bloom filtering enhances the removal of residual near-duplicates, thereby improving overall model performance. Notably, the Bloom Filter (Old Both) configuration achieves performance comparable to the hybrid deduplication setup, demonstrating its efficiency as a balanced and reliable deduplication strategy.
  • Figure 5: Classifier ablation comparison across four approaches. BETR-based FastText classifier achieves highest aggregate accuracy (0.538), outperforming DeBERTa (0.4948), DCLM-bin fasttext classifier (0.5137), and LLaMA-Score+BERT (0.5128) methods.
  • ...and 6 more figures