Table of Contents
Fetching ...

Making Sense of Data in the Wild: Data Analysis Automation at Scale

Mara Graziani, Malina Molnar, Irina Espejo Morales, Joris Cadow-Gossweiler, Teodoro Laino

TL;DR

The paper addresses the challenge of underutilized public data repositories amid an explosion of available datasets by proposing a data-scouter system that combines intelligent agents with Retrieval-Augmented Generation (RAG) to automate data analysis, curation, and indexing at scale. It introduces a multimodal analysis pipeline, license-aware data acquisition from Zenodo, RAG-based semantic indexing, and an interactive visualization for exploring dataset content, demonstrating improved description quality and retrieval diversity compared with baseline descriptions. Key findings include richer, less redundant dataset descriptions, competitive retrieval accuracy, and meaningful cross-repository alignment with HuggingFace datasets, alongside evidence that metadata-driven curation enhances downstream tasks such as realistic synthetic data generation. The work highlights practical implications for scalable data reuse and discovery, while noting storage and scalability challenges and proposing pathways to integrate the system into public repositories or batch-processing workflows to maximize impact.

Abstract

As the volume of publicly available data continues to grow, researchers face the challenge of limited diversity in benchmarking machine learning tasks. Although thousands of datasets are available in public repositories, the sheer abundance often complicates the search for suitable data, leaving many valuable datasets underexplored. This situation is further amplified by the fact that, despite longstanding advocacy for improving data curation quality, current solutions remain prohibitively time-consuming and resource-intensive. In this paper, we propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. Our system leverages multiple agents to analyze raw, unstructured data across public repositories, generating dataset reports and interactive visual indexes that can be easily explored. We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks. Additionally, we show that the dataset reports generated by our method can be leveraged by other machine learning models to improve the performance on specific tasks, such as improving the accuracy and realism of synthetic data generation. By streamlining the process of transforming raw data into machine-learning-ready datasets, our approach enables researchers to better utilize existing data resources.

Making Sense of Data in the Wild: Data Analysis Automation at Scale

TL;DR

The paper addresses the challenge of underutilized public data repositories amid an explosion of available datasets by proposing a data-scouter system that combines intelligent agents with Retrieval-Augmented Generation (RAG) to automate data analysis, curation, and indexing at scale. It introduces a multimodal analysis pipeline, license-aware data acquisition from Zenodo, RAG-based semantic indexing, and an interactive visualization for exploring dataset content, demonstrating improved description quality and retrieval diversity compared with baseline descriptions. Key findings include richer, less redundant dataset descriptions, competitive retrieval accuracy, and meaningful cross-repository alignment with HuggingFace datasets, alongside evidence that metadata-driven curation enhances downstream tasks such as realistic synthetic data generation. The work highlights practical implications for scalable data reuse and discovery, while noting storage and scalability challenges and proposing pathways to integrate the system into public repositories or batch-processing workflows to maximize impact.

Abstract

As the volume of publicly available data continues to grow, researchers face the challenge of limited diversity in benchmarking machine learning tasks. Although thousands of datasets are available in public repositories, the sheer abundance often complicates the search for suitable data, leaving many valuable datasets underexplored. This situation is further amplified by the fact that, despite longstanding advocacy for improving data curation quality, current solutions remain prohibitively time-consuming and resource-intensive. In this paper, we propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. Our system leverages multiple agents to analyze raw, unstructured data across public repositories, generating dataset reports and interactive visual indexes that can be easily explored. We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks. Additionally, we show that the dataset reports generated by our method can be leveraged by other machine learning models to improve the performance on specific tasks, such as improving the accuracy and realism of synthetic data generation. By streamlining the process of transforming raw data into machine-learning-ready datasets, our approach enables researchers to better utilize existing data resources.

Paper Structure

This paper contains 31 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: A comparison of a user-provided description that lacks informativeness for data reuse with our generated descriptions: V1, created solely from data analysis results, and V3, which incorporates data analysis results, the paper, and data examples.
  • Figure 2: (a, b) Comparison of user-defined (original) and generated description lengths, measured in (a) number of characters and (b) number of tokens. (c) Distribution of pairwise similarities: original descriptions compared to their corresponding papers versus generated descriptions compared to the same papers.
  • Figure 3: (a)Hit rate against normalized retrieval entropy of RAG based on the descriptions generated by our approach (V3-Llama-8B-0.5T) and on the original descriptions. For both measures, the higher the better. (b) Cosine similarity between pairs of vectors used by our system for the retrieval of popular HF dataset benchmarks. A value of $1$ corresponds to high-similarity in the vector space, indicating that the two datasets are related, while a value of $0$ indicates poor similarity.
  • Figure 4: Qualitative comparison of synthetic data generated with or without our metadata curation. a) Car Sales b) Iris Flowers