Table of Contents
Fetching ...

AutoDDG: Automated Dataset Description Generation using Large Language Models

Haoxiang Zhang, Yurong Liu, Aécio Santos, Wei-Lun Hung, Juliana Freire

TL;DR

This work tackles the challenge of dataset discoverability by automatically generating high-quality descriptions for tabular data. It introduces AutoDDG, a two-stage framework combining data-driven content and semantic profiling with an LLM-based description engine to produce both reader-friendly (UFD) and search-optimized (SFD) descriptions. The authors establish new benchmarks (ECIR-DDG and NTCIR-DDG) and a comprehensive evaluation methodology, showing substantial gains in retrieval performance (up to ~30% improvement in NDCG) and high-quality descriptions grounded in dataset content. They also analyze cost, scalability, and component contributions, demonstrating AutoDDG as a practical, scalable solution for improving dataset findability in large collections and data platforms.

Abstract

The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. Widely-used dataset search systems rely on keyword search over dataset metadata, including descriptions, to support discovery. Therefore, when these descriptions are incomplete, missing, or inconsistent with dataset contents, findability is severely compromised. To improve findability, we introduce AutoDDG, a framework that automatically generates descriptions of tabular data. By adopting a data-driven approach to summarize dataset contents and leveraging large language models (LLMs) to enrich summaries with semantic information and produce human-readable text, AutoDDG derives descriptions that are comprehensive, accurate, readable, and concise. A critical challenge in this problem is evaluating the effectiveness of description generation methods and assessing the quality of the generated descriptions. We propose a comprehensive evaluation methodology that combines retrieval, reference-based, and reference-free assessment, with human validation. Our experimental results using new benchmarks demonstrate that AutoDDG generates high-quality, accurate descriptions at scale, significantly improving dataset retrieval performance across diverse use cases. AutoDDG is publicly available at https://github.com/VIDA-NYU/AutoDDG.

AutoDDG: Automated Dataset Description Generation using Large Language Models

TL;DR

This work tackles the challenge of dataset discoverability by automatically generating high-quality descriptions for tabular data. It introduces AutoDDG, a two-stage framework combining data-driven content and semantic profiling with an LLM-based description engine to produce both reader-friendly (UFD) and search-optimized (SFD) descriptions. The authors establish new benchmarks (ECIR-DDG and NTCIR-DDG) and a comprehensive evaluation methodology, showing substantial gains in retrieval performance (up to ~30% improvement in NDCG) and high-quality descriptions grounded in dataset content. They also analyze cost, scalability, and component contributions, demonstrating AutoDDG as a practical, scalable solution for improving dataset findability in large collections and data platforms.

Abstract

The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. Widely-used dataset search systems rely on keyword search over dataset metadata, including descriptions, to support discovery. Therefore, when these descriptions are incomplete, missing, or inconsistent with dataset contents, findability is severely compromised. To improve findability, we introduce AutoDDG, a framework that automatically generates descriptions of tabular data. By adopting a data-driven approach to summarize dataset contents and leveraging large language models (LLMs) to enrich summaries with semantic information and produce human-readable text, AutoDDG derives descriptions that are comprehensive, accurate, readable, and concise. A critical challenge in this problem is evaluating the effectiveness of description generation methods and assessing the quality of the generated descriptions. We propose a comprehensive evaluation methodology that combines retrieval, reference-based, and reference-free assessment, with human validation. Our experimental results using new benchmarks demonstrate that AutoDDG generates high-quality, accurate descriptions at scale, significantly improving dataset retrieval performance across diverse use cases. AutoDDG is publicly available at https://github.com/VIDA-NYU/AutoDDG.

Paper Structure

This paper contains 23 sections, 1 equation, 12 figures, 11 tables, 1 algorithm.

Figures (12)

  • Figure 1: Dataset descriptions from NYC Open Data: (a) and (b) lack sufficient details, while (c) contains inconsistent information--the 2022 Yellow Taxi Trip Data dataset includes taxi trips that span multiple years, not just 2022. (d) shows a description automatically generated by AutoDDG for the dataset (a), while (e) was generated by an LLM using a data sample.
  • Figure 2: AutoDDG: a multi-stage framework for tabular dataset description generation. In the Context Preparation stage, a content and a semantic profiler derive summaries of the dataset. The LLM-powered Description Generation Engine uses the summaries to produce dataset descriptions. The system can generate descriptions tailored to specific needs.
  • Figure 3: Example summary created by the Content Profiler.
  • Figure 4: Examples of semantic information derived from datasets by the Semantic Profiler.
  • Figure 5: Prompt for generating concise dataset topics based on the dataset title, description, and sample data.
  • ...and 7 more figures