Table of Contents
Fetching ...

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray

TL;DR

The paper demonstrates that training language models on publicly available, openly licensed text is feasible at scale. By assembling the 8TB Common Pile v0.1 from 30 sources and carefully filtering, deduplicating, and mixing data, the authors train 7B-parameter Comma models on 1T and 2T tokens that match or exceed performance of budget-matched models trained on unlicensed data. They provide extensive documentation of licensing principles, data provenance, and preprocessing, and release the dataset, training mixtures, and model checkpoints to support open, auditable research. The results argue that open-license pretraining can yield competitive performance, supporting a shift toward more ethical and transparent AI development. The work also highlights ongoing challenges, such as domain coverage for commonsense tasks and the need for larger openly licensed corpora in the future.

Abstract

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

TL;DR

The paper demonstrates that training language models on publicly available, openly licensed text is feasible at scale. By assembling the 8TB Common Pile v0.1 from 30 sources and carefully filtering, deduplicating, and mixing data, the authors train 7B-parameter Comma models on 1T and 2T tokens that match or exceed performance of budget-matched models trained on unlicensed data. They provide extensive documentation of licensing principles, data provenance, and preprocessing, and release the dataset, training mixtures, and model checkpoints to support open, auditable research. The results argue that open-license pretraining can yield competitive performance, supporting a shift toward more ethical and transparent AI development. The work also highlights ongoing challenges, such as domain coverage for commonsense tasks and the need for larger openly licensed corpora in the future.

Abstract

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

Paper Structure

This paper contains 83 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The Common Pile is an 8TB dataset of openly licensed text curated from 30 diverse sources. The sources comprising the Common Pile are shown above, categorized by textual domain.
  • Figure 2: The Common Pile consistently outperforms other openly licensed corpora as a pre-training dataset. Following the setup from penedo2024fineweb, we train and evaluate 1.7B parameter models on 28B tokens of data from each dataset. Stars denote benchmarks on which the model trained using the Common Pile outperforms all other models.
  • Figure 3: Compared to models trained with similar resources (7 billion parameters, 1 trillion tokens), Comma v0.1-1T is the strongest model on several standard benchmarks. To contextualize these results, we include Qwen3 8B (trained on 36 trillion tokens) as a "current best-practices" upper bound. Stars denote benchmarks on which Comma v0.1-1T outperforms all other compute-matched models (i.e., all models other than Qwen3). Full numerical results are provided in \ref{['tab:benchmarkresults']} (appendix).
  • Figure 4: Comma v0.1-2T is also competitive with budget-matched models (7 billion parameters, 2 trillion tokens) trained on unlicensed data. We additionally include Qwen3 8B as a higher budget upper bound. Stars denote benchmarks where Comma v0.1-2T outperforms budget-matched models. Full numerical results are provided in \ref{['tab:ablation2tbenchmarkresults']} (appendix).
  • Figure 5: Author contributions to this work. Large squares indicate a major contribution and small squares indicate a supporting contribution.
  • ...and 2 more figures