yProv4DV: Reproducible Data Visualization Scripts Out of the Box

Gabriele Padovani; Sandro Fiore

yProv4DV: Reproducible Data Visualization Scripts Out of the Box

Gabriele Padovani, Sandro Fiore

Abstract

While results visualization is a critical phase to the communication of new academic results, plots are frequently shared without the complete combination of code, input data, execution context and outputs required to independently reproduce the resulting figures. Existing reproducibility solutions tend to focus on computational pipelines or workflow management systems, not covering script-based visualization practices commonly used by researchers and practitioners. Additionally, the minimalist nature of current Python data visualization libraries tend to speed up the creation of images, disincentivizing users from spending time integrating additional tools into these short scripts. This paper proposes yProv4DV, a library lightweight designed to enable reproducible data visualization scripts through the use of provenance information, minimizing the necessity for code modifications. Through a single call, users can track inputs, outputs and source code files, enabling saving and full reproducibility of their data visualization software. As a result, this library fills a gap in reproducible research workflows by addressing the reproducibility of plots in scientific publications.

yProv4DV: Reproducible Data Visualization Scripts Out of the Box

Abstract

Paper Structure (7 sections, 4 figures, 1 table)

This paper contains 7 sections, 4 figures, 1 table.

Motivation and significance
Software architecture
Software functionalities
Sample code snippets analysis
Illustrative examples
Impact
Conclusions and Future Work

Figures (4)

Figure 1: A standard data visualization pipeline compared to the one proposed when using yProv4DV. Currently, a set of Python scripts are created for the visualization of plots, often with a 1:1 ratio between images and scripts. With yProv4DV, a single Python script can be used, knowing that all runs generating an output will be tracked, and can be edited and re-used according to the user needs.
Figure 2: RO-Crate metadata file in JSON format detailing the structure of a reproducibility package. It has been simplified for visualization purposes, showing input and output files, and the Python source code used.
Figure 3: A brief example of a provenance graph containing the visualization pipeline is shown. On the left, inputs are shown, including the data used (results.csv), the requirements for the Python environment and the modules and scripts used. On the right, the image output is shown. Attributes to the main activity include the Git hash to the source code, the execution command used and the start and end time.
Figure 4: Real-world use case example, starting from a set of input files (reduced for visualizaiton purposes), five charts have been produced, creating for each execution a reproducible snapshot of the process using yProv4DV.

yProv4DV: Reproducible Data Visualization Scripts Out of the Box

Abstract

yProv4DV: Reproducible Data Visualization Scripts Out of the Box

Authors

Abstract

Table of Contents

Figures (4)