Table of Contents
Fetching ...

A metadata model for profiling multidimensional sources in data ecosystems

Claudia Diamantini, Alessandro Mele, Domenico Potena, Cristina Rossetti, Emanuele Storti

TL;DR

The paper addresses the challenge of metadata management for aggregated, multidimensional data in data ecosystems by introducing an RDF-based metadata model that extends Semantic Data Lakes with a Knowledge Graph alignment using KPIOnto. It defines data-source and attribute-level constructs (including dimensional and measures/descriptive attributes) and provides profile structures to summarize value distributions, enabling improved data discovery, quality assessment, and query support. The authors outline use cases and present experiments showing close-to-linear scalability of profile generation across increasing data cardinalities, with nuanced effects from noise and text preprocessing. This work advances interoperable, semantically enriched metadata for multidimensional data and lays groundwork for dynamic, scalable governance and integration in Data Lakes and Data Spaces.

Abstract

The Big Data landscape poses challenges in managing diverse data formats, requiring efficient storage and processing for high-quality analysis. Effective metadata management is crucial for organizing, accessing, and reusing data within these data ecosystems. Existing metadata vocabularies and standard, however, do not adequately accommodate aggregated or summary data. This paper introduces a metadata model to support semantic annotation and profiling of multidimensional data. Defined as an RDF vocabulary, the model provides a flexible and extensible graph representation for metadata at source and attribute levels, aligning dimensions and measures to a reference Knowledge Graph and summarizing value distributions in profiles. An evaluation of the execution time for profile generation is also proposed, across data sources with different cardinalities.

A metadata model for profiling multidimensional sources in data ecosystems

TL;DR

The paper addresses the challenge of metadata management for aggregated, multidimensional data in data ecosystems by introducing an RDF-based metadata model that extends Semantic Data Lakes with a Knowledge Graph alignment using KPIOnto. It defines data-source and attribute-level constructs (including dimensional and measures/descriptive attributes) and provides profile structures to summarize value distributions, enabling improved data discovery, quality assessment, and query support. The authors outline use cases and present experiments showing close-to-linear scalability of profile generation across increasing data cardinalities, with nuanced effects from noise and text preprocessing. This work advances interoperable, semantically enriched metadata for multidimensional data and lays groundwork for dynamic, scalable governance and integration in Data Lakes and Data Spaces.

Abstract

The Big Data landscape poses challenges in managing diverse data formats, requiring efficient storage and processing for high-quality analysis. Effective metadata management is crucial for organizing, accessing, and reusing data within these data ecosystems. Existing metadata vocabularies and standard, however, do not adequately accommodate aggregated or summary data. This paper introduces a metadata model to support semantic annotation and profiling of multidimensional data. Defined as an RDF vocabulary, the model provides a flexible and extensible graph representation for metadata at source and attribute levels, aligning dimensions and measures to a reference Knowledge Graph and summarizing value distributions in profiles. An evaluation of the execution time for profile generation is also proposed, across data sources with different cardinalities.

Paper Structure

This paper contains 11 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Conceptual representation of a data source according to the Dimensional Fact Model and (b) the corresponding data source metadata. Dashed lines link to attributes metadata.
  • Figure 2: Example of described metadata for the dimensional attribute city with an excerpt from the Knowledge Graph involving the attribute.
  • Figure 3: Example of metadata for an integer attribute representing a measure.
  • Figure 4: Average running time and standard deviation for profile calculation of dimensional attributes.
  • Figure 5: Average running time and standard deviation required for profile calculation of measures or descriptive attributes.