Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Rabeeh Karimi Mahabadi; Sanjeev Satheesh; Shrimai Prabhumoye; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

TL;DR

The paper introduces Nemotron-CC-Math, a 133B-token-scale high-quality math corpus built from Common Crawl via a robust, domain-agnostic extraction pipeline. It combines layout-aware Lynx rendering with an LLM-based cleaning stage to preserve equations and code, followed by quality filtering, deduplication, and rigorous decontamination. The authors show that pretraining on Nemotron-CC-Math-4+ and Nemotron-CC-Math-3+ yields substantial improvements across math, code, and general knowledge benchmarks, outperforming prior open math datasets and scaling with data. They also demonstrate an efficient, smaller-model approach for boilerplate removal and provide thorough analyses of data composition, domain distribution, and topic coverage. The work offers a practical, open-source framework for generating high-fidelity, math-rich pretraining data applicable to other technical domains.

Abstract

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

TL;DR

Abstract

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)