Table of Contents
Fetching ...

Repairing Databases over Metric Spaces with Coincidence Constraints

Youri Kaminsky, Benny Kimelfeld, Ester Livshits, Felix Naumann, David Wajc

TL;DR

This work studies the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space, and designs a (high probability) logarithmic-ratio approximation for general metrics.

Abstract

Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a Euclidean space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The goal is to update the database values to retain consistency while minimizing the total distance between the original values and the repaired ones. We consider what we refer to as \emph{coincidence constraints}, which include key constraints, inclusion, foreign keys, and generally any restriction on the relationship between the numbers of cells of different labels (attributes) coinciding in a single value, for a fixed attribute set. We begin by showing that the problem is APX-hard for general metric spaces. We then present an algorithm solving the problem optimally for tree metrics, which generalize both the line metric (i.e., where repaired values are numbers) and the discrete metric (i.e., where we simply count the number of changed values). Combining our algorithm for tree metrics and a classic result on probabilistic tree embeddings, we design a (high probability) logarithmic-ratio approximation for general metrics. We also study the variant of the problem where each individual value's allowed change is limited. In this variant, it is already NP-complete to decide the existence of any legal repair for a general metric, and we present a polynomial-time repairing algorithm for the case of a line metric.

Repairing Databases over Metric Spaces with Coincidence Constraints

TL;DR

This work studies the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space, and designs a (high probability) logarithmic-ratio approximation for general metrics.

Abstract

Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a Euclidean space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The goal is to update the database values to retain consistency while minimizing the total distance between the original values and the repaired ones. We consider what we refer to as \emph{coincidence constraints}, which include key constraints, inclusion, foreign keys, and generally any restriction on the relationship between the numbers of cells of different labels (attributes) coinciding in a single value, for a fixed attribute set. We begin by showing that the problem is APX-hard for general metric spaces. We then present an algorithm solving the problem optimally for tree metrics, which generalize both the line metric (i.e., where repaired values are numbers) and the discrete metric (i.e., where we simply count the number of changed values). Combining our algorithm for tree metrics and a classic result on probabilistic tree embeddings, we design a (high probability) logarithmic-ratio approximation for general metrics. We also study the variant of the problem where each individual value's allowed change is limited. In this variant, it is already NP-complete to decide the existence of any legal repair for a general metric, and we present a polynomial-time repairing algorithm for the case of a line metric.
Paper Structure (20 sections, 14 theorems, 9 equations, 5 figures)

This paper contains 20 sections, 14 theorems, 9 equations, 5 figures.

Key Result

Theorem 6

An optimal repair can be found in polynomial time (if exists), given a tree metric space $(M,\delta_T)$, a coincidence constraint $\Gamma$ over $M$, and an inconsistent database $D$.

Figures (5)

  • Figure 1: Example relations; same attribute names express intended foreign keys.
  • Figure 2: Optimal repairs of a database $D_{{\normalfont\textsf{pid}}}$ (middle) according to two metric spaces $(M,\delta)$ (left and right) over the person identifiers (pid).
  • Figure 3: Database $D_{{\normalfont\textsf{nurse}}}$ (on the left) and an optimal repairs $E$ according to the Hamming distance and the discrete distance (on the right).
  • Figure 4: The line metric $(M,\delta_{\mathbb{R}})$ (left) and discrete metric $(M,\delta_{\neq})$ (right) cast as tree metrics, with $M=\mathord{\{v_1,\dots,v_n\}}$.
  • Figure 5: A database $D$ over the line metric. Each shape corresponds to a cell of one of three labels (circle, triangle, square). The structure of an optimal repair by \ref{['lemma:line-suffix-contracted']} comprises of two: an optimal repair for a strict prefix, and a contracted suffix.

Theorems & Definitions (19)

  • Example 1
  • Example 2
  • Example 3
  • Example 4
  • Example 5
  • Theorem 6
  • Corollary 7
  • Lemma 8
  • Theorem 9
  • Lemma 10: DBLP:journals/jcss/FakcharoenpholRT04
  • ...and 9 more