Table of Contents
Fetching ...

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

TL;DR

The paper introduces Nemotron-CC-Math, a 133B-token-scale high-quality math corpus built from Common Crawl via a robust, domain-agnostic extraction pipeline. It combines layout-aware Lynx rendering with an LLM-based cleaning stage to preserve equations and code, followed by quality filtering, deduplication, and rigorous decontamination. The authors show that pretraining on Nemotron-CC-Math-4+ and Nemotron-CC-Math-3+ yields substantial improvements across math, code, and general knowledge benchmarks, outperforming prior open math datasets and scaling with data. They also demonstrate an efficient, smaller-model approach for boilerplate removal and provide thorough analyses of data composition, domain distribution, and topic coverage. The work offers a practical, open-source framework for generating high-fidelity, math-rich pretraining data applicable to other technical domains.

Abstract

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

TL;DR

The paper introduces Nemotron-CC-Math, a 133B-token-scale high-quality math corpus built from Common Crawl via a robust, domain-agnostic extraction pipeline. It combines layout-aware Lynx rendering with an LLM-based cleaning stage to preserve equations and code, followed by quality filtering, deduplication, and rigorous decontamination. The authors show that pretraining on Nemotron-CC-Math-4+ and Nemotron-CC-Math-3+ yields substantial improvements across math, code, and general knowledge benchmarks, outperforming prior open math datasets and scaling with data. They also demonstrate an efficient, smaller-model approach for boilerplate removal and provide thorough analyses of data composition, domain distribution, and topic coverage. The work offers a practical, open-source framework for generating high-fidelity, math-rich pretraining data applicable to other technical domains.

Abstract

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.

Paper Structure

This paper contains 28 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the Nemotron-CC-Math construction pipeline. Starting from Common Crawl snapshots, we extract math-related URLs using curated datasets (e.g., MegaMath, FineMath). After fetching 229.54M webpages, we render pages through Lynx and apply LLM-based cleaning, quality filtering, and deduplication (see § \ref{['sec:math_extraction_pipeline']}). We visualize the topic distribution of our data (Right).
  • Figure 2: Mathematical expressions on HTML pages appear in diverse formats—LaTeX within custom delimiters, <pre> blocks, image tags, and MathML. These variations challenge standard text extraction pipelines, which often fail to recover the underlying LaTeX equations correctly. To address this, we use an LLM to standardize all mathematical representations into a unified LaTeX format.
  • Figure 3: Data mixtures for each phase of pretraining experiments presented in Table \ref{['tab:merged-pretraining-results']}.