Table of Contents
Fetching ...

FAIR Jupyter: a knowledge graph approach to semantic sharing and granular exploration of a computational notebook reproducibility dataset

Sheeba Samuel, Daniel Mietchen

TL;DR

The paper presents FAIR Jupyter, a knowledge-graph-based framework that semantically enriches a computational reproducibility dataset of Jupyter notebooks linked to biomedical publications. It describes a two-stage workflow: generating the reproducibility dataset from PubMed Central and converting it into a KG using established ontologies (e.g., PROV-O, REPRODUCE-ME, P-Plan, PAV, FaBiO) and mappings (YARRRML/RML) loaded via Morph-KGC into Apache Jena Fuseki. The resulting KG, containing roughly 190 million triples and accessible through a public SPARQL endpoint, enables fine-grained queries and profiling of content types, reproducibility outcomes, and cross-resource relationships, with example queries illustrating practical use cases in research and education. This semantified sharing enhances findability, interoperability, and reusability while supporting reproducibility workflows and potential federations with external datasets, thus providing a scalable, machine-actionable, FAIR-compliant resource for researchers, educators, and policy makers.

Abstract

The way in which data are shared can affect their utility and reusability. Here, we demonstrate how data that we had previously shared in bulk can be mobilized further through a knowledge graph that allows for much more granular exploration and interrogation. The original dataset is about the computational reproducibility of GitHub-hosted Jupyter notebooks associated with biomedical publications. It contains rich metadata about the publications, associated GitHub repositories and Jupyter notebooks, and the notebooks' reproducibility. We took this dataset, converted it into semantic triples and loaded these into a triple store to create a knowledge graph, FAIR Jupyter, that we made accessible via a web service. This enables granular data exploration and analysis through queries that can be tailored to specific use cases. Such queries may provide details about any of the variables from the original dataset, highlight relationships between them or combine some of the graph's content with materials from corresponding external resources. We provide a collection of example queries addressing a range of use cases in research and education. We also outline how sets of such queries can be used to profile specific content types, either individually or by class. We conclude by discussing how such a semantically enhanced sharing of complex datasets can both enhance their FAIRness, i.e., their findability, accessibility, interoperability, and reusability, and help identify and communicate best practices, particularly with regards to data quality, standardization, automation and reproducibility.

FAIR Jupyter: a knowledge graph approach to semantic sharing and granular exploration of a computational notebook reproducibility dataset

TL;DR

The paper presents FAIR Jupyter, a knowledge-graph-based framework that semantically enriches a computational reproducibility dataset of Jupyter notebooks linked to biomedical publications. It describes a two-stage workflow: generating the reproducibility dataset from PubMed Central and converting it into a KG using established ontologies (e.g., PROV-O, REPRODUCE-ME, P-Plan, PAV, FaBiO) and mappings (YARRRML/RML) loaded via Morph-KGC into Apache Jena Fuseki. The resulting KG, containing roughly 190 million triples and accessible through a public SPARQL endpoint, enables fine-grained queries and profiling of content types, reproducibility outcomes, and cross-resource relationships, with example queries illustrating practical use cases in research and education. This semantified sharing enhances findability, interoperability, and reusability while supporting reproducibility workflows and potential federations with external datasets, thus providing a scalable, machine-actionable, FAIR-compliant resource for researchers, educators, and policy makers.

Abstract

The way in which data are shared can affect their utility and reusability. Here, we demonstrate how data that we had previously shared in bulk can be mobilized further through a knowledge graph that allows for much more granular exploration and interrogation. The original dataset is about the computational reproducibility of GitHub-hosted Jupyter notebooks associated with biomedical publications. It contains rich metadata about the publications, associated GitHub repositories and Jupyter notebooks, and the notebooks' reproducibility. We took this dataset, converted it into semantic triples and loaded these into a triple store to create a knowledge graph, FAIR Jupyter, that we made accessible via a web service. This enables granular data exploration and analysis through queries that can be tailored to specific use cases. Such queries may provide details about any of the variables from the original dataset, highlight relationships between them or combine some of the graph's content with materials from corresponding external resources. We provide a collection of example queries addressing a range of use cases in research and education. We also outline how sets of such queries can be used to profile specific content types, either individually or by class. We conclude by discussing how such a semantically enhanced sharing of complex datasets can both enhance their FAIRness, i.e., their findability, accessibility, interoperability, and reusability, and help identify and communicate best practices, particularly with regards to data quality, standardization, automation and reproducibility.
Paper Structure (12 sections, 2 figures, 4 tables)

This paper contains 12 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Workflow overview. The blue workflow was used to construct the original dataset samuel2023Dataset and is described in samuel2024computational, whereas the subsequent knowledge graph construction workflow shown in green represents the current study.
  • Figure 2: Partial outline of the data model used in FAIR Jupyter. Classes of entities (represented by ellipses) and the class properties (represented by orange rectangles) were inferred from the original dataset, and -- along with relationships between them (arrows) -- expressed in terms of relevant ontologies. Note that requirement files and repository files are both represented as repr:File.