Table of Contents
Fetching ...

The Evolution of LLM Adoption in Industry Data Curation Practices

Crystal Qian, Michael Xieyang Liu, Emily Reif, Grady Simon, Nada Hussein, Nathan Clement, James Wexler, Carrie J. Cai, Michael Terry, Minsuk Kahng

TL;DR

This study investigates how data practitioners in a large technology organization adopt LLMs for unstructured data curation, using a three-stage program—an exploratory survey ($N=84$), expert interviews ($N=10$), and a user study with two LLM-based design probes ($N=12$). The findings reveal a shift from heuristic, bottom-up data understanding to insights-first, top-down workflows supported by LLMs, complemented by the emergence of silver and super-golden datasets to improve labeling and evaluation. While early adoption was limited, the design-probe studies show potential productivity gains and broad appeal for spreadsheet- and notebook–integrated LLM tooling, alongside barriers related to reliability, scale, and unfamiliarity with new features. Collectively, the work points to a paradigm where LLMs augment data practitioners’ workflows, enabling more targeted, high-quality data curation and heralding future directions in tool development, governance, and multimodal data handling.

Abstract

As large language models (LLMs) grow increasingly adept at processing unstructured text data, they offer new opportunities to enhance data curation workflows. This paper explores the evolution of LLM adoption among practitioners at a large technology company, evaluating the impact of LLMs in data curation tasks through participants' perceptions, integration strategies, and reported usage scenarios. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution. In Q2 2023, we conducted a survey to assess LLM adoption in industry for development tasks (N=84), and facilitated expert interviews to assess evolving data needs (N=10) in Q3 2023. In Q2 2024, we explored practitioners' current and anticipated LLM usage through a user study involving two LLM-based prototypes (N=12). While each study addressed distinct research goals, they revealed a broader narrative about evolving LLM usage in aggregate. We discovered an emerging shift in data understanding from heuristic-first, bottom-up approaches to insights-first, top-down workflows supported by LLMs. Furthermore, to respond to a more complex data landscape, data practitioners now supplement traditional subject-expert-created 'golden datasets' with LLM-generated 'silver' datasets and rigorously validated 'super golden' datasets curated by diverse experts. This research sheds light on the transformative role of LLMs in large-scale analysis of unstructured data and highlights opportunities for further tool development.

The Evolution of LLM Adoption in Industry Data Curation Practices

TL;DR

This study investigates how data practitioners in a large technology organization adopt LLMs for unstructured data curation, using a three-stage program—an exploratory survey (), expert interviews (), and a user study with two LLM-based design probes (). The findings reveal a shift from heuristic, bottom-up data understanding to insights-first, top-down workflows supported by LLMs, complemented by the emergence of silver and super-golden datasets to improve labeling and evaluation. While early adoption was limited, the design-probe studies show potential productivity gains and broad appeal for spreadsheet- and notebook–integrated LLM tooling, alongside barriers related to reliability, scale, and unfamiliarity with new features. Collectively, the work points to a paradigm where LLMs augment data practitioners’ workflows, enabling more targeted, high-quality data curation and heralding future directions in tool development, governance, and multimodal data handling.

Abstract

As large language models (LLMs) grow increasingly adept at processing unstructured text data, they offer new opportunities to enhance data curation workflows. This paper explores the evolution of LLM adoption among practitioners at a large technology company, evaluating the impact of LLMs in data curation tasks through participants' perceptions, integration strategies, and reported usage scenarios. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution. In Q2 2023, we conducted a survey to assess LLM adoption in industry for development tasks (N=84), and facilitated expert interviews to assess evolving data needs (N=10) in Q3 2023. In Q2 2024, we explored practitioners' current and anticipated LLM usage through a user study involving two LLM-based prototypes (N=12). While each study addressed distinct research goals, they revealed a broader narrative about evolving LLM usage in aggregate. We discovered an emerging shift in data understanding from heuristic-first, bottom-up approaches to insights-first, top-down workflows supported by LLMs. Furthermore, to respond to a more complex data landscape, data practitioners now supplement traditional subject-expert-created 'golden datasets' with LLM-generated 'silver' datasets and rigorously validated 'super golden' datasets curated by diverse experts. This research sheds light on the transformative role of LLMs in large-scale analysis of unstructured data and highlights opportunities for further tool development.

Paper Structure

This paper contains 56 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The tabular LLM-based prompting interface within the spreadsheet design probe. The cells in column A include prompts (i.e., questions to AI agents asked by crowd users) from the Chatbot Arena Conversation Dataset lmsys_chatbot_2024. The header of the second column (B1-B3) contains an instruction that users of the probe can specify. The cells in the column are automatically populated with LLM outputs, generated by running an LLM query that combines the specified instruction from the header with the corresponding data in column A (e.g., =RUN_PROMPT(CONCATENATE(B1, B2, B3, A8))). Column C shows another prompt.
  • Figure 2: The tabular LLM-based prompting interface within the notebook design probe. This example shows a new tone column added to a dataframe, which asks "What is the tone of this text?" on the prompt column. Outputs are not constrained. The output dataframe with the new tone column is displayed below the form.
  • Figure 3: The summative LLM-based prompting interface within the notebook design probe. The example illustrates querying "What is this dataset about?" for the prompt column of a dataframe.