Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

Gowtham; Sai Rupesh; Sanjay Kumar; Saravanan; Venkata Chaithanya

Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

TL;DR

Blu-WERP presents a scalable preprocessing pipeline that maximizes training data quality from web-scale sources by integrating semantic-aware filtering, multi-level deduplication via Bloom filters, and benchmark-targeted classifier selection. The approach is validated with controlled ablations and a benchmark-driven evaluation framework, demonstrating superior aggregate performance and favorable scaling trajectories compared with state-of-the-art baselines. Key contributions include a BETR-based FastText classifier, a Bloom Filter-based deduplication strategy, and a scaling-law data selection protocol that predicts performance at larger model sizes without full-scale training. The results highlight the importance of data-centric AI design for improving LLM performance and efficiency, offering a practical, reproducible framework for researchers and practitioners.

Abstract

High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior performance across all model scales. At the 1B parameter scale, Relatively Blu-WERP demonstrates a 4.0% and 9.5% aggregate improvement over DCLM and Fineweb respectively, while achieving quality-per-token efficiency gain. Categorical analysis reveals 2.4% improvement in World Knowledge & Reasoning, 6.2% improvement in Language Understanding, and 4.2% improvement in Commonsense Reasoning. These results establish Blu-WERP as a state-of-the-art preprocessing pipeline that substantially improves LLM training data quality and downstream model performance with reduced computational cost. Our findings contribute to the growing body of research on data-centric AI, demonstrating that preprocessing pipeline design significantly impacts LLM capabilities. The Blu-WERP pipeline represents a practical advancement in data quality optimization, offering researchers and practitioners an effective solution for improving LLM training efficiency and model performance.

Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

TL;DR

Abstract

Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)