Table of Contents
Fetching ...

Understanding the Dataset Practitioners Behind Large Language Model Development

Crystal Qian, Emily Reif, Minsuk Kahng

TL;DR

The paper defines the dataset practitioner and investigates their workflows in large language model development through a Google-centered retrospective analysis and semi-structured interviews (N=10). It reveals that data quality is the top priority but remains variably defined, leading to intuition-driven validation and reliance on custom analyses with limited uptake of existing tooling. The authors propose two explanations for the tooling gap—an evolving, nascent field and diverse, use-case-specific needs—and highlight opportunities to standardize metrics and develop flexible tooling to improve alignment. These insights inform human–computer interaction research and tooling development aimed at improving qualitative data exploration in unstructured data pipelines.

Abstract

As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.

Understanding the Dataset Practitioners Behind Large Language Model Development

TL;DR

The paper defines the dataset practitioner and investigates their workflows in large language model development through a Google-centered retrospective analysis and semi-structured interviews (N=10). It reveals that data quality is the top priority but remains variably defined, leading to intuition-driven validation and reliance on custom analyses with limited uptake of existing tooling. The authors propose two explanations for the tooling gap—an evolving, nascent field and diverse, use-case-specific needs—and highlight opportunities to standardize metrics and develop flexible tooling to improve alignment. These insights inform human–computer interaction research and tooling development aimed at improving qualitative data exploration in unstructured data pipelines.

Abstract

As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.
Paper Structure (20 sections, 1 figure, 2 tables)

This paper contains 20 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 2: This matrix categorizes our findings (inspired by Kandel et al. kandel). An 'x' in the cell indicates that a participant mentioned this specific topic in their interview. Topics are grouped by Processes, Tools, and Challenges, and participant are grouped by their domain from Table \ref{['table:participants']}. All participants mentioned interacting with spreadsheets and cited data quality as a challenge in their work.