Datasets for Large Language Models: A Comprehensive Survey

Yang Liu; Jiahuan Cao; Chongyu Liu; Kai Ding; Lianwen Jin

Datasets for Large Language Models: A Comprehensive Survey

Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, Lianwen Jin

TL;DR

This survey consolidates the landscape of LLM datasets along five dimensions—pre-training corpora, instruction fine-tuning, preference data, evaluation datasets, and traditional NLP corpora—into a single reference. It catalogs 444 datasets across 8 language families and 32 domains, detailing data types, licenses, and scales (e.g., >774.5 TB for pre-training; 700M instances for other datasets). The authors analyze preprocessing pipelines, domain coverage, data diversity, and evaluation methodologies, and they discuss bottlenecks and future directions for data selection, quality control, and ecosystem development. The work highlights gaps in multilingual coverage, domain-specific resources, and standardized evaluation frameworks, offering concrete guidance for building a more cohesive, transparent data ecosystem for LLM research and deployment.

Abstract

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.

Datasets for Large Language Models: A Comprehensive Survey

TL;DR

Abstract

Paper Structure (167 sections, 21 figures, 30 tables)

This paper contains 167 sections, 21 figures, 30 tables.

Introduction
Pre-training Corpora
General Pre-training Corpora
Webpages
Languages Texts
Books
Academic Materials
Code
Parallel Corpus
Social Media
Encyclopedia
Multi-category Corpora
Domain-specific Pre-training Corpora
Financial Domain
Medical Domain
...and 152 more sections

Figures (21)

Figure 1: The overall architecture of the survey. Zoom in for better view
Figure 2: A timeline of some representative LLM datasets. Orange represents pre-training corpora, yellow represents instruction fine-tuning datasets, green represents preference datasets, and pink represents evaluation datasets
Figure 3: Data categories of the general pre-training corpora
Figure 4: Classification of books. Categorizing books into 30 fine-grained classes based on different domains
Figure 5: Pie charts depicting the data type distribution of selected multi-category pre-training corpora. The corresponding pre-training corpus names are positioned above each pie chart. Different colors represent distinct data types
...and 16 more figures

Datasets for Large Language Models: A Comprehensive Survey

TL;DR

Abstract

Datasets for Large Language Models: A Comprehensive Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (21)