Table of Contents
Fetching ...

Data Management For Training Large Language Models: A Survey

Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu

TL;DR

This survey addresses how data management shapes the pretraining and supervised fine-tuning of large language models, detailing domain composition, data quantity, and data quality strategies, as well as their interactions. It synthesizes evidence on scaling laws, deduplication, toxicity filtering, and instruction design, and discusses dynamic data-efficient learning and multi-task vs single-task approaches in SFT. The authors highlight challenges such as debiasing, multimodal data management, data synthesis, and the need for a general, autonomous data management framework. The work provides guidance for practitioners aiming to build powerful LLMs through efficient, principled data management and identifies promising directions for future research.

Abstract

Data plays a fundamental role in training Large Language Models (LLMs). Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the underlying mechanism of current prominent practices are still unknown. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various aspects of data management strategy design. Looking into the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through efficient data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.

Data Management For Training Large Language Models: A Survey

TL;DR

This survey addresses how data management shapes the pretraining and supervised fine-tuning of large language models, detailing domain composition, data quantity, and data quality strategies, as well as their interactions. It synthesizes evidence on scaling laws, deduplication, toxicity filtering, and instruction design, and discusses dynamic data-efficient learning and multi-task vs single-task approaches in SFT. The authors highlight challenges such as debiasing, multimodal data management, data synthesis, and the need for a general, autonomous data management framework. The work provides guidance for practitioners aiming to build powerful LLMs through efficient, principled data management and identifies promising directions for future research.

Abstract

Data plays a fundamental role in training Large Language Models (LLMs). Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the underlying mechanism of current prominent practices are still unknown. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various aspects of data management strategy design. Looking into the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through efficient data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.
Paper Structure (41 sections, 2 equations, 3 figures, 2 tables)

This paper contains 41 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Data management pipelines for the pretraining and supervised fine-tuning of Large Language Models.
  • Figure 2: The domain composition of prominent Large Language Models.
  • Figure 3: Taxonomy of research in data management for pretraining and supervised fine-tuning of Large Language Models (LLM).