Table of Contents
Fetching ...

Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

Zhiwei Li, Carl Kesselman, Mike D'Arch, Michael Pazzani, Benjamin Yizing Xu

TL;DR

This paper addresses reproducibility and correctness gaps in ML for eScience by proposing Continuous FAIRness, a data-centric, socio-technical framework built on the Deriva platform and the Deriva-ML library. It introduces a two-part metadata catalog (Domain schema and ML schema) and robust data packaging (BagIt/BDBag) with Minid identifiers to ensure findability, accessibility, interoperability, and reusability of all data and artifacts. The architecture supports secure, policy-driven collaboration and provenance-aware ML workflows across computing environments, demonstrated through EyeAI (glaucoma detection) and MusMorph genotype prediction use cases, with formal evaluation using FAIR Metrics. The work aims to improve reproducibility and collaboration in eScience by making data the central artifact and enabling end-to-end traceability from data modeling to model deployment, while outlining future enhancements in versioning and workflow configuration.

Abstract

Increasingly, artificial intelligence (AI) and machine learning (ML) are used in eScience applications [9]. While these approaches have great potential, the literature has shown that ML-based approaches frequently suffer from results that are either incorrect or unreproducible due to mismanagement or misuse of data used for training and validating the models [12, 15]. Recognition of the necessity of high-quality data for correct ML results has led to data-centric ML approaches that shift the central focus from model development to creation of high-quality data sets to train and validate the models [14, 20]. However, there are limited tools and methods available for data-centric approaches to explore and evaluate ML solutions for eScience problems which often require collaborative multidisciplinary teams working with models and data that will rapidly evolve as an investigation unfolds [1]. In this paper, we show how data management tools based on the principle that all of the data for ML should be findable, accessible, interoperable and reusable (i.e. FAIR [26]) can significantly improve the quality of data that is used for ML applications. When combined with best practices that apply these tools to the entire life cycle of an ML-based eScience investigation, we can significantly improve the ability of an eScience team to create correct and reproducible ML solutions. We propose an architecture and implementation of such tools and demonstrate through two use cases how they can be used to improve ML-based eScience investigations.

Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

TL;DR

This paper addresses reproducibility and correctness gaps in ML for eScience by proposing Continuous FAIRness, a data-centric, socio-technical framework built on the Deriva platform and the Deriva-ML library. It introduces a two-part metadata catalog (Domain schema and ML schema) and robust data packaging (BagIt/BDBag) with Minid identifiers to ensure findability, accessibility, interoperability, and reusability of all data and artifacts. The architecture supports secure, policy-driven collaboration and provenance-aware ML workflows across computing environments, demonstrated through EyeAI (glaucoma detection) and MusMorph genotype prediction use cases, with formal evaluation using FAIR Metrics. The work aims to improve reproducibility and collaboration in eScience by making data the central artifact and enabling end-to-end traceability from data modeling to model deployment, while outlining future enhancements in versioning and workflow configuration.

Abstract

Increasingly, artificial intelligence (AI) and machine learning (ML) are used in eScience applications [9]. While these approaches have great potential, the literature has shown that ML-based approaches frequently suffer from results that are either incorrect or unreproducible due to mismanagement or misuse of data used for training and validating the models [12, 15]. Recognition of the necessity of high-quality data for correct ML results has led to data-centric ML approaches that shift the central focus from model development to creation of high-quality data sets to train and validate the models [14, 20]. However, there are limited tools and methods available for data-centric approaches to explore and evaluate ML solutions for eScience problems which often require collaborative multidisciplinary teams working with models and data that will rapidly evolve as an investigation unfolds [1]. In this paper, we show how data management tools based on the principle that all of the data for ML should be findable, accessible, interoperable and reusable (i.e. FAIR [26]) can significantly improve the quality of data that is used for ML applications. When combined with best practices that apply these tools to the entire life cycle of an ML-based eScience investigation, we can significantly improve the ability of an eScience team to create correct and reproducible ML solutions. We propose an architecture and implementation of such tools and demonstrate through two use cases how they can be used to improve ML-based eScience investigations.
Paper Structure (29 sections, 6 figures, 1 table)

This paper contains 29 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: The Deriva-ML Architecture has three main components. The Deriva platform for scientific asset management is responsible for data discovery via a metadata catalog service (ERMRest), a versioned object store for storing all data assets (Hatrac), and a model-driven adaptive user interface (Chaise). Big Data Bags (BDBag) and a Python interface library, Deriva-ML, support data migration, and map lower-level Deriva operations into higher-level interfaces that more closely align with the ML development process. We also support various on-demand computing environments used for shared ML development. The key roles are the research software engineers who build the system, domain experts who curate the data through the Deriva platform, and machine learning engineers who develop ML models in computing environments.
  • Figure 2: The metadata catalog is structured into two interconnected components. The upper component is tailored to a specific domain. In the upper part of the figure, we show the instantiation used for the use case in Section \ref{['section-background']}. The lower part of the figure shows the data model for the reusable part of the catalog that models the overall ML data objects produced by the ML development process. The model consists of descriptions of the data element (in blue), data assets (in yellow), and controlled vocabulary (in green).
  • Figure 3: The main components and methods of Deriva-ML library. The left section is the base class, Deriva-ML, which contains three essential methods to support the ML process tracking that is general to various applications. The right section shows the derived class Domain-ML. It contains three types of methods customized to the application and domain data model for efficient data management. The components W1 through W6 facilitate the entire ML workflow, with W4 and W6 optional.
  • Figure 4: End-to-end data-centric best practice involving seven steps. The workflow is cyclical, with the last step (Evolution of controlled vocabulary and data model) looping back to the first step (Data Modeling), reflecting the iterative process. The essential roles involved are shown at the top of each box, which includes Domain Experts, Research Engineers, and ML Engineers.
  • Figure 5: The detailed page of an Execution in the Chaise UI. It contains the workflow, datasets, execution assets, and execution metadata associated with the execution to ensure reproducibility.
  • ...and 1 more figures