Towards Enabling FAIR Dataspaces Using Large Language Models
Benedikt T. Arnold, Johannes Theissen-Lipp, Diego Collarana, Christoph Lange, Sandra Geisler, Edward Curry, Stefan Decker
TL;DR
The paper addresses the adoption barrier of FAIR dataspaces arising from Semantic Web complexity and proposes Large Language Models (LLMs) as a means to lower costs and facilitate adoption. It presents a concrete proof-of-concept using GPT-4 to assist FAIR dataspace tasks—extending SHACL-based metadata, leveraging GND for cultural data, and generating dataset instances and usage policies—while discussing provenance and limitations. A research agenda outlines six focus areas (interactive vs automated systems, adaptation, knowledge integration, open models for data sovereignty, efficiency, and safety) to systematically study LLMs in dataspaces. The work highlights the potential of LLMs to enhance FAIRness and interoperability in dataspaces, while underscoring the need for provenance, guardrails, and open, sovereign-model options to ensure practical, trustworthy deployment.
Abstract
Dataspaces have recently gained adoption across various sectors, including traditionally less digitized domains such as culture. Leveraging Semantic Web technologies helps to make dataspaces FAIR, but their complexity poses a significant challenge to the adoption of dataspaces and increases their cost. The advent of Large Language Models (LLMs) raises the question of how these models can support the adoption of FAIR dataspaces. In this work, we demonstrate the potential of LLMs in dataspaces with a concrete example. We also derive a research agenda for exploring this emerging field.
