AutoDDG: Automated Dataset Description Generation using Large Language Models
Haoxiang Zhang, Yurong Liu, Aécio Santos, Wei-Lun Hung, Juliana Freire
TL;DR
This work tackles the challenge of dataset discoverability by automatically generating high-quality descriptions for tabular data. It introduces AutoDDG, a two-stage framework combining data-driven content and semantic profiling with an LLM-based description engine to produce both reader-friendly (UFD) and search-optimized (SFD) descriptions. The authors establish new benchmarks (ECIR-DDG and NTCIR-DDG) and a comprehensive evaluation methodology, showing substantial gains in retrieval performance (up to ~30% improvement in NDCG) and high-quality descriptions grounded in dataset content. They also analyze cost, scalability, and component contributions, demonstrating AutoDDG as a practical, scalable solution for improving dataset findability in large collections and data platforms.
Abstract
The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. Widely-used dataset search systems rely on keyword search over dataset metadata, including descriptions, to support discovery. Therefore, when these descriptions are incomplete, missing, or inconsistent with dataset contents, findability is severely compromised. To improve findability, we introduce AutoDDG, a framework that automatically generates descriptions of tabular data. By adopting a data-driven approach to summarize dataset contents and leveraging large language models (LLMs) to enrich summaries with semantic information and produce human-readable text, AutoDDG derives descriptions that are comprehensive, accurate, readable, and concise. A critical challenge in this problem is evaluating the effectiveness of description generation methods and assessing the quality of the generated descriptions. We propose a comprehensive evaluation methodology that combines retrieval, reference-based, and reference-free assessment, with human validation. Our experimental results using new benchmarks demonstrate that AutoDDG generates high-quality, accurate descriptions at scale, significantly improving dataset retrieval performance across diverse use cases. AutoDDG is publicly available at https://github.com/VIDA-NYU/AutoDDG.
