Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun
TL;DR
This survey systematically organizes data assessment and selection methods for instruction tuning of LLMs into three dimensions: quality, diversity, and importance. It catalogs hand-crafted and model-based indicators, GPT-based scoring, and human evaluation for data quality; it details diversity measures including lexical/semantic metrics and geometry-based coreset sampling; and it reviews importance-based techniques such as uncertainty, reward, and gradient/influence-based approaches, including bilevel optimization frameworks. The paper analyzes results, discusses hybrid strategies, and identifies open challenges—benchmarking alignment, defining good data, scaling pipelines, and fairness considerations—while proposing promising avenues like proxy models and scalable clustering to handle ever-larger datasets and models. Overall, it provides a structured blueprint for designing robust, scalable data assessment and selection pipelines that improve instruction-tuning efficiency and effectiveness. The findings highlight that combining quality, diversity, and importance often yields the best trade-offs, with significant practical implications for constructing high-quality instruction datasets under budget constraints. Future directions stress benchmarking integrity, unified data-goodness definitions, scalable pipelines, and fairness validation as critical areas for advancing instruction-tuning data practices.
Abstract
Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between the latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering.
