Table of Contents
Fetching ...

FLUX: Data Worth Training On

Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

Abstract

Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data using only 39B tokens, resulting in a 34.4% reduction in training compute. At the data level, FLUX extracts 50B usable tokens from a single dump (CC-MAIN-2025-51), compared to 40B from DCLM (+25% retention). FLUX-Base yields 192B tokens, exceeding FineWeb's 170B while still maintaining superior quality. Overall, FLUX establishes a new state of the art in web-scale data preprocessing by demonstrating that high retention, strong quality control, and computational efficiency can be achieved simultaneously, redefining the limits of scalable dataset construction for modern language models.

FLUX: Data Worth Training On

Abstract

Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data using only 39B tokens, resulting in a 34.4% reduction in training compute. At the data level, FLUX extracts 50B usable tokens from a single dump (CC-MAIN-2025-51), compared to 40B from DCLM (+25% retention). FLUX-Base yields 192B tokens, exceeding FineWeb's 170B while still maintaining superior quality. Overall, FLUX establishes a new state of the art in web-scale data preprocessing by demonstrating that high retention, strong quality control, and computational efficiency can be achieved simultaneously, redefining the limits of scalable dataset construction for modern language models.
Paper Structure (19 sections, 1 equation, 9 figures, 8 tables)

This paper contains 19 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Compute Savings at Equal Performance (3B scale). DCLM requires $1.227\times10^{21}$ FLOPs to reach 50.48% aggregate score, whereas FLUX reaches the same performance with $8.044\times10^{20}$ FLOPs (about 34.4% lower compute).
  • Figure 2: Aggregate performance after the parsing stage. Blu-WERP exhibits a measurable advantage over DCLM immediately following text extraction.
  • Figure 3: Aggregate performance after heuristic filtering. The aggressive filtering stage substantially attenuates the parsing-stage advantage observed for Blu-WERP.
  • Figure 4: Parser-Level Downstream Performance Comparison (530M scale; 10.6B tokens). This figure reports aggregated downstream benchmark performance for models trained on corpora produced by each parser configuration (Apex, Resiliparse, and Trafilatura) under the same training setup. It highlights the quality--retention trade-off across parsers and motivates the final parser choice used in the FLUX pipeline.
  • Figure 5: Heuristic filtering summary (530M scale; 10.6B tokens). Overview figure for the FLUX filtering design and its retention--quality behavior.
  • ...and 4 more figures