Table of Contents
Fetching ...

Datasets for Large Language Models: A Comprehensive Survey

Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, Lianwen Jin

TL;DR

This survey consolidates the landscape of LLM datasets along five dimensions—pre-training corpora, instruction fine-tuning, preference data, evaluation datasets, and traditional NLP corpora—into a single reference. It catalogs 444 datasets across 8 language families and 32 domains, detailing data types, licenses, and scales (e.g., >774.5 TB for pre-training; 700M instances for other datasets). The authors analyze preprocessing pipelines, domain coverage, data diversity, and evaluation methodologies, and they discuss bottlenecks and future directions for data selection, quality control, and ecosystem development. The work highlights gaps in multilingual coverage, domain-specific resources, and standardized evaluation frameworks, offering concrete guidance for building a more cohesive, transparent data ecosystem for LLM research and deployment.

Abstract

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.

Datasets for Large Language Models: A Comprehensive Survey

TL;DR

This survey consolidates the landscape of LLM datasets along five dimensions—pre-training corpora, instruction fine-tuning, preference data, evaluation datasets, and traditional NLP corpora—into a single reference. It catalogs 444 datasets across 8 language families and 32 domains, detailing data types, licenses, and scales (e.g., >774.5 TB for pre-training; 700M instances for other datasets). The authors analyze preprocessing pipelines, domain coverage, data diversity, and evaluation methodologies, and they discuss bottlenecks and future directions for data selection, quality control, and ecosystem development. The work highlights gaps in multilingual coverage, domain-specific resources, and standardized evaluation frameworks, offering concrete guidance for building a more cohesive, transparent data ecosystem for LLM research and deployment.

Abstract

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.
Paper Structure (167 sections, 21 figures, 30 tables)

This paper contains 167 sections, 21 figures, 30 tables.

Figures (21)

  • Figure 1: The overall architecture of the survey. Zoom in for better view
  • Figure 2: A timeline of some representative LLM datasets. Orange represents pre-training corpora, yellow represents instruction fine-tuning datasets, green represents preference datasets, and pink represents evaluation datasets
  • Figure 3: Data categories of the general pre-training corpora
  • Figure 4: Classification of books. Categorizing books into 30 fine-grained classes based on different domains
  • Figure 5: Pie charts depicting the data type distribution of selected multi-category pre-training corpora. The corresponding pre-training corpus names are positioned above each pie chart. Different colors represent distinct data types
  • ...and 16 more figures