Table of Contents
Fetching ...

A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

Junyu Luo, Bohan Wu, Xiao Luo, Zhiping Xiao, Yiqiao Jin, Rong-Cheng Tu, Nan Yin, Yifan Wang, Jingyang Yuan, Wei Ju, Ming Zhang

TL;DR

This paper addresses the data-efficiency challenge in post-training for large language models by advocating a data-centric perspective. It introduces the data value flywheel and a five-part taxonomy—Data Selection, Data Quality Enhancement, Synthetic Data Generation, Data Distillation and Compression, and Self-Evolving Data Ecosystem—and systematically surveys representative methods in each category. The authors analyze benefits, trade-offs, and open problems, offering directions such as meta-learning for selection, domain-specific data synthesis, and unified self-evolving frameworks. The work aims to guide researchers and practitioners toward data-efficient, robust LLM post-training with reduced annotation costs and improved generalization across domains.

Abstract

Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM

A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

TL;DR

This paper addresses the data-efficiency challenge in post-training for large language models by advocating a data-centric perspective. It introduces the data value flywheel and a five-part taxonomy—Data Selection, Data Quality Enhancement, Synthetic Data Generation, Data Distillation and Compression, and Self-Evolving Data Ecosystem—and systematically surveys representative methods in each category. The authors analyze benefits, trade-offs, and open problems, offering directions such as meta-learning for selection, domain-specific data synthesis, and unified self-evolving frameworks. The work aims to guide researchers and practitioners toward data-efficient, robust LLM post-training with reduced annotation costs and improved generalization across domains.

Abstract

Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM

Paper Structure

This paper contains 38 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Illustration of the data flywheel in Data-Efficient LLM Post-Training, depicting the iterative cycle of data selection, data quality enhancement, synthetic data generation, knowledge distillation, and self-evolving data ecosystems to maximize model performance with minimal data requirements.
  • Figure 2: A taxonomy of Data-Efficient LLM Post Training.
  • Figure 3: Overview of four major data selection approach categories: static filtering, dynamic selection, agent strategy, and labeling efficiency.
  • Figure 4: Three key approaches for data quality enhancement in LLM post-training: semantic rewriting for diversity, toxicity control for safety, and distribution stabilization for balanced representation.
  • Figure 5: Three main approaches for data generation in LLM post-training: instruction-driven generation for creating instruction-response pairs, knowledge-guided generation using structured knowledge, and adversarial generation for testing model robustness.
  • ...and 4 more figures