Prompting Datasets: Data Discovery with Conversational Agents

Johanna Walker; Elisavet Koutsiana; Joe Massey; Gefion Thuermer; Elena Simperl

Prompting Datasets: Data Discovery with Conversational Agents

Johanna Walker, Elisavet Koutsiana, Joe Massey, Gefion Thuermer, Elena Simperl

TL;DR

This work investigates whether conversational generative AI can support data discovery by enabling both the search and sensemaking phases. Through three user workshops employing GPT-3.5, GPT-4, and Bard across English and German contexts, the study evaluates CGAI’s ability to suggest datasets, explain rationale, and provide supportive data documentation and code. Findings show that CGAIs can propose relevant data and assist with analysis and usage guidance, but are hampered by hallucinations, inconsistent outputs, and limited live data access, underscoring the need for external verification and improved prompting. The authors contribute a prompting-oriented interaction model and guidelines for data prompting, highlighting CGAI’s potential to augment data literacy and workflow integration while emphasizing reliability as a key area for future work. The work has practical implications for designing end-to-end data-discovery tools that leverage CGAI responsibly, with attention to provenance, licensing, and multilingual support.

Abstract

Can large language models assist in data discovery? Data discovery predominantly happens via search on a data portal or the web, followed by assessment of the dataset to ensure it is fit for the intended purpose. The ability of conversational generative AI (CGAI) to support recommendations with reasoning implies it can suggest datasets to users, explain why it has done so, and provide information akin to documentation regarding the dataset in order to support a use decision. We hold 3 workshops with data users and find that, despite limitations around web capabilities, CGAIs are able to suggest relevant datasets and provide many of the required sensemaking activities, as well as support dataset analysis and manipulation. However, CGAIs may also suggest fictional datasets, and perform inaccurate analysis. We identify emerging practices in data discovery and present a model of these to inform future research directions and data prompt design.

Prompting Datasets: Data Discovery with Conversational Agents

TL;DR

Abstract

Paper Structure (43 sections, 9 figures, 11 tables)

This paper contains 43 sections, 9 figures, 11 tables.

Introduction
Background Literature
Dataset discovery
CGAI for information retrieval
Summary
Methodology
Data collection
Data preparation and analysis
Ethics
Findings
RQ1: How do CGAI compare with state of the art dataset discovery technology?
Success rates of conversational dataset search
Query and chat format
Results of data requests
Self-assessment of efficacy
...and 28 more sections

Figures (9)

Figure 1: Screenshots of a data discovery journey showing (a) dataset search on Google Data Search; (b) previewing the data on data.gov.uk website; (c) opening the data in excel; (d) data visualisation in excel and (e) evaluation information search on Google.
Figure 2: Koesten et al's (2017) framework for interaction with structured data, showing the 5 steps of the data journey for users of secondary data. koesten2017trials.
Figure 3: Results from the questionnaires regarding previously used methods (multiple answers possible).
Figure 4: Results from the questionnaires regarding familiarity with the conversational agents.
Figure 5: The figures present results from the questionnaires after the chat with the conversation agents regarding (a) how useful was the support of the agent (b) how successful was the outcome.
...and 4 more figures

Prompting Datasets: Data Discovery with Conversational Agents

TL;DR

Abstract

Prompting Datasets: Data Discovery with Conversational Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (9)