Table of Contents
Fetching ...

We Need Improved Data Curation and Attribution in AI for Scientific Discovery

Mara Graziani, Antonio Foncubierta, Dimitrios Christofidellis, Irina Espejo-Morales, Malina Molnar, Marvin Alberts, Matteo Manica, Jannis Born

TL;DR

The paper assesses the rising role of synthetic data in scientific discovery and the ensuing data integrity challenges, showing that most public datasets remain underutilized and that distinguishing real from synthetic content is increasingly difficult. It proposes automated data-curation and provenance practices, plus targeted watermarking of real data, as practical strategies to preserve model robustness while integrating synthetic data. Across molecular, textual, transcriptomic, and spectral modalities, detection of synthetic data is modality-dependent, underscoring the need for domain-specific provenance and attribution mechanisms. Platform analyses of Zenodo and HuggingFace quantify data uptake delays and synthetic-content growth to inform policy on tagging and traceability, advocating scalable, semi-automated workflows for robust data management. The work highlights actionable paths for data curation and attribution to mitigate misinformation and model collapse in AI-driven scientific discovery.

Abstract

As the interplay between human-generated and synthetic data evolves, new challenges arise in scientific discovery concerning the integrity of the data and the stability of the models. In this work, we examine the role of synthetic data as opposed to that of real experimental data for scientific research. Our analyses indicate that nearly three-quarters of experimental datasets available on open-access platforms have relatively low adoption rates, opening new opportunities to enhance their discoverability and usability by automated methods. Additionally, we observe an increasing difficulty in distinguishing synthetic from real experimental data. We propose supplementing ongoing efforts in automating synthetic data detection by increasing the focus on watermarking real experimental data, thereby strengthening data traceability and integrity. Our estimates suggest that watermarking even less than half of the real world data generated annually could help sustain model robustness, while promoting a balanced integration of synthetic and human-generated content.

We Need Improved Data Curation and Attribution in AI for Scientific Discovery

TL;DR

The paper assesses the rising role of synthetic data in scientific discovery and the ensuing data integrity challenges, showing that most public datasets remain underutilized and that distinguishing real from synthetic content is increasingly difficult. It proposes automated data-curation and provenance practices, plus targeted watermarking of real data, as practical strategies to preserve model robustness while integrating synthetic data. Across molecular, textual, transcriptomic, and spectral modalities, detection of synthetic data is modality-dependent, underscoring the need for domain-specific provenance and attribution mechanisms. Platform analyses of Zenodo and HuggingFace quantify data uptake delays and synthetic-content growth to inform policy on tagging and traceability, advocating scalable, semi-automated workflows for robust data management. The work highlights actionable paths for data curation and attribution to mitigate misinformation and model collapse in AI-driven scientific discovery.

Abstract

As the interplay between human-generated and synthetic data evolves, new challenges arise in scientific discovery concerning the integrity of the data and the stability of the models. In this work, we examine the role of synthetic data as opposed to that of real experimental data for scientific research. Our analyses indicate that nearly three-quarters of experimental datasets available on open-access platforms have relatively low adoption rates, opening new opportunities to enhance their discoverability and usability by automated methods. Additionally, we observe an increasing difficulty in distinguishing synthetic from real experimental data. We propose supplementing ongoing efforts in automating synthetic data detection by increasing the focus on watermarking real experimental data, thereby strengthening data traceability and integrity. Our estimates suggest that watermarking even less than half of the real world data generated annually could help sustain model robustness, while promoting a balanced integration of synthetic and human-generated content.

Paper Structure

This paper contains 20 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Analysis of dataset uploads, adoption rates, and the percentage of synthetic content in data-sharing platforms. A. Trends in dataset uploads, showing the number of bytes uploaded over the past decade (2014-2024) B. Dataset adoption rates in 2024, represented as the cumulative distribution function (ECDF) of dataset downloads. This shows that $87$% of Zenodo datasets and $83$% of HF dataset have fewer than 100 downloads. C and D. Estimates of the proportion of synthetic data in C. HF datasets and D. Zenodo datasets, reported as the fraction of synthetic bytes compared to the total uploaded bytes.
  • Figure 2: Frequency of AI-favored terms in scientific paper abstracts. A. Percentage of abstracts containing at least one of six common AI-favored words: delve, underscore, intricacies, groundbreaking, scrutinize and meticulous. Results are categorized by field and publication status, based on an analysis of the metadata from $1.5$M preprints from arXiv, bioRxiv, chemRxiv and medRxiv, as well as $5.5$M published papers. B. Breakdown of the relative frequency of each word, compared to that observed in the year 2019.
  • Figure 3: Impact of inaccurate data on scientific literature and model training. A. Number of paper retractions categorized by publisher, with retractions from Hindawi excluded from the analysis. B. Propagation effect. 1-hop: papers citing at least one retracted paper; 2-hops: papers citing at least one paper that cites a retracted paper; and 3-hops: papers that continue this chain, citing a paper that cites 2-hop paper. C. Observed performance degradation as increasing fractions of synthetic data are introduced during model training to predict molecular structure from IR spectra. Accuracy is measured by comparing the ground truth to the Top-1, -5 and -10 ranked predicted molecules. The quality of the synthetic data introduces a bias, which impacts the model's performance.
  • Figure 4: Estimated log-likelihood response to increasing fractions of watermarked data for human-generated content. Computed following kazdan2024collapse estimates on the relationship between real content and model performance for models with 512 tokens of context length.
  • Figure 5: Record creation to publication time. Time in years passing from the creation date to the publication date of a record in Zenodo: (a) Absolute time (b) Overall time that takes into account retrospective uploads.
  • ...and 3 more figures