Table of Contents
Fetching ...

DatAasee -- A Metadata-Lake as Metadata Catalog for a Virtual Data-Lake

Christian Himpe

TL;DR

This paper addresses the challenge of managing metadata for distributed research data sources in university libraries by introducing the metadata-lake concept. It formalizes a data-lake-inspired architecture for aggregating and querying metadata, specifies a formal model with intra- and inter-object metadata and a metadata graph, and presents DatAasee as an open-source PoC. DatAasee uses a three-tier software architecture with a graph-oriented NoSQL store (ArcadeDB) and a RESTful JSON API, supporting multiple metadata formats and ingestion protocols (e.g., OAI-PMH, DataCite, DublinCore). Preliminary evaluation discusses feature coverage aligned with existing metadata-layer criteria and FAIR principles, highlighting DatAasee's potential as a centralized metadata layer for virtual data-lakes, dataspaces, and FAIR repositories.

Abstract

Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata aggregator is presented, too, and also evaluated.

DatAasee -- A Metadata-Lake as Metadata Catalog for a Virtual Data-Lake

TL;DR

This paper addresses the challenge of managing metadata for distributed research data sources in university libraries by introducing the metadata-lake concept. It formalizes a data-lake-inspired architecture for aggregating and querying metadata, specifies a formal model with intra- and inter-object metadata and a metadata graph, and presents DatAasee as an open-source PoC. DatAasee uses a three-tier software architecture with a graph-oriented NoSQL store (ArcadeDB) and a RESTful JSON API, supporting multiple metadata formats and ingestion protocols (e.g., OAI-PMH, DataCite, DublinCore). Preliminary evaluation discusses feature coverage aligned with existing metadata-layer criteria and FAIR principles, highlighting DatAasee's potential as a centralized metadata layer for virtual data-lakes, dataspaces, and FAIR repositories.

Abstract

Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata aggregator is presented, too, and also evaluated.
Paper Structure (28 sections, 3 equations, 4 figures)

This paper contains 28 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Abstract metadata-lake.
  • Figure 2: DatAasee metadata-lake.
  • Figure 3: DatAasee outward architecture.
  • Figure 4: DatAasee inward architecture.