A metadata model for profiling multidimensional sources in data ecosystems
Claudia Diamantini, Alessandro Mele, Domenico Potena, Cristina Rossetti, Emanuele Storti
TL;DR
The paper addresses the challenge of metadata management for aggregated, multidimensional data in data ecosystems by introducing an RDF-based metadata model that extends Semantic Data Lakes with a Knowledge Graph alignment using KPIOnto. It defines data-source and attribute-level constructs (including dimensional and measures/descriptive attributes) and provides profile structures to summarize value distributions, enabling improved data discovery, quality assessment, and query support. The authors outline use cases and present experiments showing close-to-linear scalability of profile generation across increasing data cardinalities, with nuanced effects from noise and text preprocessing. This work advances interoperable, semantically enriched metadata for multidimensional data and lays groundwork for dynamic, scalable governance and integration in Data Lakes and Data Spaces.
Abstract
The Big Data landscape poses challenges in managing diverse data formats, requiring efficient storage and processing for high-quality analysis. Effective metadata management is crucial for organizing, accessing, and reusing data within these data ecosystems. Existing metadata vocabularies and standard, however, do not adequately accommodate aggregated or summary data. This paper introduces a metadata model to support semantic annotation and profiling of multidimensional data. Defined as an RDF vocabulary, the model provides a flexible and extensible graph representation for metadata at source and attribute levels, aligning dimensions and measures to a reference Knowledge Graph and summarizing value distributions in profiles. An evaluation of the execution time for profile generation is also proposed, across data sources with different cardinalities.
