Table of Contents
Fetching ...

RedPajama: an Open Dataset for Training Large Language Models

Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, Ce Zhang

TL;DR

The paper introduces RedPajama, a pair of open, transparent, and scalable pretraining data resources for large language models: RPv1, an open reproduction of the LLaMA training corpus used to train early LLaMA-family models, and RPv2, a massive web-only dataset enriched with quality signals to enable principled data filtering. RPv1 includes the RedPajama-INCITE models trained on Summit hardware, with detailed accounts of the infrastructure challenges, training regimen, and performance relative to LLaMA at multiple scales. RPv2 shifts to web data with per-document quality signals (46 measures) and deduplication, offering dataset statistics and ablation studies demonstrating how filtering rules and signals affect downstream performance on a diverse benchmark suite. The work emphasizes transparency, scale, and versatility, provides extensive artifacts to facilitate dataset curation, and highlights directions for future open-model development and responsible data practices.

Abstract

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

RedPajama: an Open Dataset for Training Large Language Models

TL;DR

The paper introduces RedPajama, a pair of open, transparent, and scalable pretraining data resources for large language models: RPv1, an open reproduction of the LLaMA training corpus used to train early LLaMA-family models, and RPv2, a massive web-only dataset enriched with quality signals to enable principled data filtering. RPv1 includes the RedPajama-INCITE models trained on Summit hardware, with detailed accounts of the infrastructure challenges, training regimen, and performance relative to LLaMA at multiple scales. RPv2 shifts to web data with per-document quality signals (46 measures) and deduplication, offering dataset statistics and ablation studies demonstrating how filtering rules and signals affect downstream performance on a diverse benchmark suite. The work emphasizes transparency, scale, and versatility, provides extensive artifacts to facilitate dataset curation, and highlights directions for future open-model development and responsible data practices.

Abstract

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

Paper Structure

This paper contains 35 sections, 8 figures, 23 tables.

Figures (8)

  • Figure 1: The ecosystem around the RedPajama datasets. RedPajama has provided pretraining data for multiple open-source LLMs, including OpenELM mehta2024openelm, OLMo groeneveld2024olmo, Snowflake's Arctic SnowflakeArctic2023 and RedPajama-INCITE. SlimPajama is a cleaned and deduplicated version of RedPajama-V1.
  • Figure 2: RedPajama-INCITE-Base 3B results on a subset of lm-evaluation-harness. The tasks were selected according to the selection made to evaluate Pythia biderman2023pythia and GPT-J gpt-j
  • Figure 3: Chronological count of documents for each CommonCrawl snapshot before and after deduplication. Deduplication is performed sequentially, starting from the most recent snapshot and iterating until the oldest snapshot.
  • Figure 4: Histograms for the quality signals computed by the CCNet wenzek2019ccnet pipeline.
  • Figure 5: Histograms for ML-based quality signals.
  • ...and 3 more figures