Understanding the Dataset Practitioners Behind Large Language Model Development
Crystal Qian, Emily Reif, Minsuk Kahng
TL;DR
The paper defines the dataset practitioner and investigates their workflows in large language model development through a Google-centered retrospective analysis and semi-structured interviews (N=10). It reveals that data quality is the top priority but remains variably defined, leading to intuition-driven validation and reliance on custom analyses with limited uptake of existing tooling. The authors propose two explanations for the tooling gap—an evolving, nascent field and diverse, use-case-specific needs—and highlight opportunities to standardize metrics and develop flexible tooling to improve alignment. These insights inform human–computer interaction research and tooling development aimed at improving qualitative data exploration in unstructured data pipelines.
Abstract
As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.
