Table of Contents
Fetching ...

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life \& Earth Sciences

Genoveva Vargas-Solar, Jérôme Darmont, Alejandro Adorjan, Javier A. Espinosa-Oviedo, Carmem Hara, Sabine Loudcher, Regina Motz, Martin Musicante, José-Luis Zechinelli-Martini

TL;DR

This paper argues that Life & Earth sciences face data-management challenges due to heterogeneous, high-volume data and the need for reproducibility. It proposes a data lake architecture augmented with formal and semi-automatic curation tools and a research-in-the-loop framework to involve scientists in data selection and governance. The work outlines a pivot meta-representation and a global knowledge graph to integrate harvested data, models, and knowledge, and discusses dataverses as domain-specific, citable publication spaces. Pilot experiments in seismology and biodiversity in Brazil are planned to test the approach, with the aim of enhancing open science, data sharing, and cross-disciplinary discovery. Overall, the vision targets a scalable, governed, and shareable data ecosystem that supports data-driven experimentation in Life & Earth sciences.

Abstract

This vision paper introduces a pioneering data lake architecture designed to meet Life \& Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to scientific discovery.The core of the design and construction of a data lake is the development of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments. Our unique ''research-in-the-loop'' methodology ensures that scientists across various disciplines are integrally involved in the curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of Life \& Earth sciences to solve some of our time's most critical environmental and biological challenges.

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life \& Earth Sciences

TL;DR

This paper argues that Life & Earth sciences face data-management challenges due to heterogeneous, high-volume data and the need for reproducibility. It proposes a data lake architecture augmented with formal and semi-automatic curation tools and a research-in-the-loop framework to involve scientists in data selection and governance. The work outlines a pivot meta-representation and a global knowledge graph to integrate harvested data, models, and knowledge, and discusses dataverses as domain-specific, citable publication spaces. Pilot experiments in seismology and biodiversity in Brazil are planned to test the approach, with the aim of enhancing open science, data sharing, and cross-disciplinary discovery. Overall, the vision targets a scalable, governed, and shareable data ecosystem that supports data-driven experimentation in Life & Earth sciences.

Abstract

This vision paper introduces a pioneering data lake architecture designed to meet Life \& Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to scientific discovery.The core of the design and construction of a data lake is the development of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments. Our unique ''research-in-the-loop'' methodology ensures that scientists across various disciplines are integrally involved in the curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of Life \& Earth sciences to solve some of our time's most critical environmental and biological challenges.
Paper Structure (22 sections, 1 figure)