Table of Contents
Fetching ...

dtreg: Describing Data Analysis in Machine-Readable Format in Python and R

Olga Lezhnina, Manuel Prinz, Markus Stocker

TL;DR

The paper proposes dtreg, a Python and R package that enables pre-publication, machine-readable reporting of data analyses by organizing results according to registered data types (DTRs) and converting them to JSON-LD. It integrates ePIC and ORKG schemata within the Loom framework to enhance semantic interoperability and offline accessibility, reducing reliance on post-publication extraction. The authors detail the architecture, documentation practices, and a t-test use case on Iris data to demonstrate end-to-end workflow, highlighting strong test coverage and open-source licensing. This approach aims to advance FAIR data principles by providing structured, reusable, and interoperable representations of analytical findings that can be harnessed across research workflows and knowledge graphs.

Abstract

For scientific knowledge to be findable, accessible, interoperable, and reusable, it needs to be machine-readable. Moving forward from post-publication extraction of knowledge, we adopted a pre-publication approach to write research findings in a machine-readable format at early stages of data analysis. For this purpose, we developed the package dtreg in Python and R. Registered and persistently identified data types, aka schemata, which dtreg applies to describe data analysis in a machine-readable format, cover the most widely used statistical tests and machine learning methods. The package supports (i) downloading a relevant schema as a mutable instance of a Python or R class, (ii) populating the instance object with metadata about data analysis, and (iii) converting the object into a lightweight Linked Data format. This paper outlines the background of our approach, explains the code architecture, and illustrates the functionality of dtreg with a machine-readable description of a t-test on Iris Data. We suggest that the dtreg package can enhance the methodological repertoire of researchers aiming to adhere to the FAIR principles.

dtreg: Describing Data Analysis in Machine-Readable Format in Python and R

TL;DR

The paper proposes dtreg, a Python and R package that enables pre-publication, machine-readable reporting of data analyses by organizing results according to registered data types (DTRs) and converting them to JSON-LD. It integrates ePIC and ORKG schemata within the Loom framework to enhance semantic interoperability and offline accessibility, reducing reliance on post-publication extraction. The authors detail the architecture, documentation practices, and a t-test use case on Iris data to demonstrate end-to-end workflow, highlighting strong test coverage and open-source licensing. This approach aims to advance FAIR data principles by providing structured, reusable, and interoperable representations of analytical findings that can be harnessed across research workflows and knowledge graphs.

Abstract

For scientific knowledge to be findable, accessible, interoperable, and reusable, it needs to be machine-readable. Moving forward from post-publication extraction of knowledge, we adopted a pre-publication approach to write research findings in a machine-readable format at early stages of data analysis. For this purpose, we developed the package dtreg in Python and R. Registered and persistently identified data types, aka schemata, which dtreg applies to describe data analysis in a machine-readable format, cover the most widely used statistical tests and machine learning methods. The package supports (i) downloading a relevant schema as a mutable instance of a Python or R class, (ii) populating the instance object with metadata about data analysis, and (iii) converting the object into a lightweight Linked Data format. This paper outlines the background of our approach, explains the code architecture, and illustrates the functionality of dtreg with a machine-readable description of a t-test on Iris Data. We suggest that the dtreg package can enhance the methodological repertoire of researchers aiming to adhere to the FAIR principles.

Paper Structure

This paper contains 12 sections, 3 figures.

Figures (3)

  • Figure 1: Selecting a schema for a data analysis method. Schemata are shown in yellow boxes, analytical choices in white boxes.
  • Figure 2: The relationship between the user input and the dtreg functionality. The user selects the schema based on the data analysis method. The entirety of data analysis information (the method, the data, and the test results) is used to populate the instance.
  • Figure 3: Information flow in the dtreg package using input-process-output (IPO) model.