Table of Contents
Fetching ...

A Decision Tree to Shepherd Scientists through Data Retrievability

Andrea Bianchi, Giordano d'Aloisio, Francesca Marzi, Antinisca Di Marco

TL;DR

The paper addresses the data reproducibility gap by emphasizing data retrievability, the ability to obtain datasets in the same form as used in experiments, including preprocessing context. It introduces a Data Retrievability Tree to guide researchers through access scenarios, preprocessing provenance, and metadata provisions, forming a structured method to assess and improve dataset retrievability. The main contribution is the eight-node decision tree that yields actionable metadata guidance and can serve as the foundation for an automated metadata generation web service. This approach aims to enhance transparency and replicability of research by facilitating precise data access and reuse, with future work extending to usability documentation and automated tooling.

Abstract

Reproducibility is a crucial aspect of scientific research that involves the ability to independently replicate experimental results by analysing the same data or repeating the same experiment. Over the years, many works have been proposed to make the results of the experiments actually reproducible. However, very few address the importance of data reproducibility, defined as the ability of independent researchers to retain the same dataset used as input for experimentation. Properly addressing the problem of data reproducibility is crucial because often just providing a link to the data is not enough to make the results reproducible. In fact, also proper metadata (e.g., preprocessing instruction) must be provided to make a dataset fully reproducible. In this work, our aim is to fill this gap by proposing a decision tree to sheperd researchers through the reproducibility of their datasets. In particular, this decision tree guides researchers through identifying if the dataset is actually reproducible and if additional metadata (i.e., additional resources needed to reproduce the data) must also be provided. This decision tree will be the foundation of a future application that will automate the data reproduction process by automatically providing the necessary metadata based on the particular context (e.g., data availability, data preprocessing, and so on). It is worth noting that, in this paper, we detail the steps to make a dataset retrievable, while we will detail other crucial aspects for reproducibility (e.g., dataset documentation) in future works.

A Decision Tree to Shepherd Scientists through Data Retrievability

TL;DR

The paper addresses the data reproducibility gap by emphasizing data retrievability, the ability to obtain datasets in the same form as used in experiments, including preprocessing context. It introduces a Data Retrievability Tree to guide researchers through access scenarios, preprocessing provenance, and metadata provisions, forming a structured method to assess and improve dataset retrievability. The main contribution is the eight-node decision tree that yields actionable metadata guidance and can serve as the foundation for an automated metadata generation web service. This approach aims to enhance transparency and replicability of research by facilitating precise data access and reuse, with future work extending to usability documentation and automated tooling.

Abstract

Reproducibility is a crucial aspect of scientific research that involves the ability to independently replicate experimental results by analysing the same data or repeating the same experiment. Over the years, many works have been proposed to make the results of the experiments actually reproducible. However, very few address the importance of data reproducibility, defined as the ability of independent researchers to retain the same dataset used as input for experimentation. Properly addressing the problem of data reproducibility is crucial because often just providing a link to the data is not enough to make the results reproducible. In fact, also proper metadata (e.g., preprocessing instruction) must be provided to make a dataset fully reproducible. In this work, our aim is to fill this gap by proposing a decision tree to sheperd researchers through the reproducibility of their datasets. In particular, this decision tree guides researchers through identifying if the dataset is actually reproducible and if additional metadata (i.e., additional resources needed to reproduce the data) must also be provided. This decision tree will be the foundation of a future application that will automate the data reproduction process by automatically providing the necessary metadata based on the particular context (e.g., data availability, data preprocessing, and so on). It is worth noting that, in this paper, we detail the steps to make a dataset retrievable, while we will detail other crucial aspects for reproducibility (e.g., dataset documentation) in future works.
Paper Structure (4 sections, 1 figure)

This paper contains 4 sections, 1 figure.

Figures (1)

  • Figure 1: Data Retrievability Tree