High-Level ETL for Semantic Data Warehouses -- Full Version
Rudra Pratap Deb Nath, Oscar Romero, Torben Bach Pedersen, Katja Hose
TL;DR
This paper tackles the challenge of enabling OLAP-like analysis over semantic data by introducing a two-layer RDF-based ETL framework that preserves semantic content during integration. It defines a Definition Layer for metadata and mappings and an Execution Layer for automated, high-level ETL operations, linked via a SourceToTargetMapping vocabulary. The authors implement a prototype, SETLCONSTRUCT, and its automatic variant SETLAUTO, and demonstrate substantial productivity gains and competitive performance by integrating a Danish Business dataset with EU Subsidy data to form a multidimensional semantic data warehouse annotated with QB4OLAP. The approach hinges on MD semantics, provenance tracking, and automated flow generation to reduce manual coding while maintaining semantic fidelity. Overall, the work offers a scalable, schema-driven pathway to synthesizing SDWs from heterogeneous semantic sources, with clear empirical gains in development efficiency and evidence of practical viability.
Abstract
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.
