DatAasee -- A Metadata-Lake as Metadata Catalog for a Virtual Data-Lake
Christian Himpe
TL;DR
This paper addresses the challenge of managing metadata for distributed research data sources in university libraries by introducing the metadata-lake concept. It formalizes a data-lake-inspired architecture for aggregating and querying metadata, specifies a formal model with intra- and inter-object metadata and a metadata graph, and presents DatAasee as an open-source PoC. DatAasee uses a three-tier software architecture with a graph-oriented NoSQL store (ArcadeDB) and a RESTful JSON API, supporting multiple metadata formats and ingestion protocols (e.g., OAI-PMH, DataCite, DublinCore). Preliminary evaluation discusses feature coverage aligned with existing metadata-layer criteria and FAIR principles, highlighting DatAasee's potential as a centralized metadata layer for virtual data-lakes, dataspaces, and FAIR repositories.
Abstract
Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata aggregator is presented, too, and also evaluated.
