Data-Centric AI in the Age of Large Language Models

Xinyi Xu; Zhaoxuan Wu; Rui Qiao; Arun Verma; Yao Shu; Jingtan Wang; Xinyuan Niu; Zhenfeng He; Jiangwei Chen; Zijian Zhou; Gregory Kang Ruey Lau; Hieu Dao; Lucas Agussurja; Rachael Hwee Ling Sim; Xiaoqiang Lin; Wenyang Hu; Zhongxiang Dai; Pang Wei Koh; Bryan Kian Hsiang Low

Data-Centric AI in the Age of Large Language Models

Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low

TL;DR

This position paper reframes AI research around data, arguing that data quality, provenance, and contextual information are central to the development and use of large language models (LLMs). It proposes four data-centered scenarios—benchmarks and curation, data attribution, knowledge transfer, and inference contextualization—to drive data-centric benchmarking, open research, cost-effective distillation, and personalized inference. By advocating DataComp-inspired benchmarks, data attribution/unlearning, synthesized data for distillation, and retrieval/in-context strategies for personalization, the work highlights practical paths to safer, democratized, and more adaptable LLM technologies. The paper emphasizes multi-layered safety, openness, and accessibility, while acknowledging limitations in integration, guarantee of unlearning, and governance when scaling data-centric approaches.

Abstract

This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.

Data-Centric AI in the Age of Large Language Models

TL;DR

Abstract

Paper Structure (32 sections, 1 figure)

This paper contains 32 sections, 1 figure.

Introduction
Benchmarks and curation for training data.
Data attribution.
Knowledge transfer.
Inference contextualization with data.
Rigorous Data-centric Benchmarks
Research Directions: Benchmarks and Data Curation
Benchmarks for heterogeneous and large-scale pretraining data.
Benchmarks for adapting to downstream domains and tasks.
Dataset design and curation.
Impact: Data-centric Open LLM Research
Data Attribution
Research Directions: Data Attribution and Unlearning
Data attribution.
Unlearning of data.
...and 17 more sections

Figures (1)

Figure 1: \ref{['sec:benchmarking']} (indexed 1 in the figure) underscores the importance of the training data (for both pretraining and fine-tuning) and the data curation techniques. \ref{['sec:attribution']} (indexed 2 in the figure) highlights that the LLMs' outputs depend on the training data. \ref{['sec:distillation']} (indexed 3 in the figure) describes the "knowledge" of the LLMs to be transferred from some training data. \ref{['sec:context']} (indexed 4 in the figure) demonstrates the usage of data by the LLMs at inference (i.e., response to a query).

Data-Centric AI in the Age of Large Language Models

TL;DR

Abstract

Data-Centric AI in the Age of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)