From keywords to semantics: Perceptions of large language models in data discovery
Maura E Halstead, Mark A. Green, Caroline Jay, Richard Kingston, David Topping, Alexander Singleton
TL;DR
This study investigates researchers' acceptance of large language models (LLMs) for data discovery through 27 focus-group participants across four user types. A three-theme conceptual model emerges: LLMs have potential to transform the research process, but significant barriers (bias, unreliability, ethics, and trust) exist; transparency about data, models, and responses can overcome these barriers and foster adoption. The findings underscore that LLMs should augment—not replace—human judgment, and they provide concrete design implications to enhance transparency and trust in AI-assisted data discovery. The work advances human-centered AI in data ecosystems by outlining actionable requirements for trustworthy, explainable, and inclusive LLM-enabled search tools.
Abstract
Current approaches to data discovery match keywords between metadata and queries. This matching requires researchers to know the exact wording that other researchers previously used, creating a challenging process that could lead to missing relevant data. Large Language Models (LLMs) could enhance data discovery by removing this requirement and allowing researchers to ask questions with natural language. However, we do not currently know if researchers would accept LLMs for data discovery. Using a human-centered artificial intelligence (HCAI) focus, we ran focus groups (N = 27) to understand researchers' perspectives towards LLMs for data discovery. Our conceptual model shows that the potential benefits are not enough for researchers to use LLMs instead of current technology. Barriers prevent researchers from fully accepting LLMs, but features around transparency could overcome them. Using our model will allow developers to incorporate features that result in an increased acceptance of LLMs for data discovery.
