Table of Contents
Fetching ...

Algebraic Data Integration

Patrick Schultz, Ryan Wisnesky

TL;DR

This paper develops an algebraic framework for data integration by treating schemas as equational theories and instances as their algebras, with data migrations captured by adjoint functors $\Sigma_F$, $\Delta_F$, and $\Pi_F$, and pushouts used to form integrated schemas and instances. It introduces the Uber-flower query language, which provides a concise, computable syntax for expressing complex data migrations, and shows how to convert between Uber-flowers and traditional data migrations. The CQL tool embodies these ideas, offering decision procedures for equality in theories, saturation into term models, conservativity checks, and efficient (co-)evaluation of Uber-flower queries. A pushout-based design pattern is developed to guide algebraic data integration, supplemented by practical examples (e.g., medical records) and entity-resolution techniques. The work advances a computable, category-theoretic approach to data integration, bridging functional programming, category theory, and database theory, with potential for broader adoption and further theoretical development.

Abstract

In this paper we develop an algebraic approach to data integration by combining techniques from functional programming, category theory, and database theory. In our formalism, database schemas and instances are algebraic (multi-sorted equational) theories of a certain form. Schemas denote categories, and instances denote their initial (term) algebras. The instances on a schema S form a category, S-Inst, and a morphism of schemas F : S -> T induces three adjoint data migration functors: Sigma_F : S-Inst -> T-Inst, defined by substitution along F, which has a right adjoint Delta_F : T-Inst -> S-Inst, which in turn has a right adjoint Pi_F : S-Inst -> T-Inst. We present a query language based on for/where/return syntax where each query denotes a sequence of data migration functors; a pushout-based design pattern for performing data integration using our formalism; and describe the implementation of our formalism in a tool we call CQL.

Algebraic Data Integration

TL;DR

This paper develops an algebraic framework for data integration by treating schemas as equational theories and instances as their algebras, with data migrations captured by adjoint functors , , and , and pushouts used to form integrated schemas and instances. It introduces the Uber-flower query language, which provides a concise, computable syntax for expressing complex data migrations, and shows how to convert between Uber-flowers and traditional data migrations. The CQL tool embodies these ideas, offering decision procedures for equality in theories, saturation into term models, conservativity checks, and efficient (co-)evaluation of Uber-flower queries. A pushout-based design pattern is developed to guide algebraic data integration, supplemented by practical examples (e.g., medical records) and entity-resolution techniques. The work advances a computable, category-theoretic approach to data integration, bridging functional programming, category theory, and database theory, with potential for broader adoption and further theoretical development.

Abstract

In this paper we develop an algebraic approach to data integration by combining techniques from functional programming, category theory, and database theory. In our formalism, database schemas and instances are algebraic (multi-sorted equational) theories of a certain form. Schemas denote categories, and instances denote their initial (term) algebras. The instances on a schema S form a category, S-Inst, and a morphism of schemas F : S -> T induces three adjoint data migration functors: Sigma_F : S-Inst -> T-Inst, defined by substitution along F, which has a right adjoint Delta_F : T-Inst -> S-Inst, which in turn has a right adjoint Pi_F : S-Inst -> T-Inst. We present a query language based on for/where/return syntax where each query denotes a sequence of data migration functors; a pushout-based design pattern for performing data integration using our formalism; and describe the implementation of our formalism in a tool we call CQL.

Paper Structure

This paper contains 48 sections, 92 equations, 28 figures.

Figures (28)

  • Figure 1: A Schema and Instance in the Original Functorial Data Model
  • Figure 2: Example Functorial Data Migrations
  • Figure 3: The Attribute Problem
  • Figure 4: The Multi-sorted Equational Theory $Type$
  • Figure 5: Inference Rules for Multi-sorted Equational Logic
  • ...and 23 more figures