Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning
Joan Giner-Miguelez, Abel Gómez, Jordi Cabot
TL;DR
This work tackles the challenge of unstructured dataset documentation by introducing a retrieval-augmented prompting approach that uses large language models to extract key dimensions and produce machine-readable metadata. The method relies on data preprocessing, diverse prompt types, and chained workflows to systematically recover information on use, provenance, licensing, data composition, gathering, annotation, and social concerns. Evaluated on 12 data papers from Scientific Data and Data in Brief using GPT-3.5 and Flan-UL2, the approach achieves high overall accuracy with GPT-3.5 (~81%) and reveals a notable drop with Flan-UL2 (~69%), while also diagnosing hallucination patterns. The authors release an open-source tool, dataDocAnalyzer, and a replication package, enabling practitioners to automate dataset documentation analysis, improve regulatory compliance checks, and boost dataset discoverability.
Abstract
Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using large language models (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them. In this paper, we evaluate the approach on 12 scientific dataset papers published in two scientific journals (Nature's Scientific Data and Elsevier's Data in Brief) using two different LLMs (GPT3.5 and Flan-UL2). Results show good accuracy with our prompt extraction strategies. Concrete results vary depending on the dimensions, but overall, GPT3.5 shows slightly better accuracy (81,21%) than FLAN-UL2 (69,13%) although it is more prone to hallucinations. We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results, in an open-source repository.
