Cleaning data with Swipe

Toon Boeckling; Antoon Bronselaer

Cleaning data with Swipe

Toon Boeckling, Antoon Bronselaer

TL;DR

This work tackles repairing databases under functional dependencies by value updates to minimize total change cost, a problem known to be NP-hard. It introduces Swipe, a single-path Chase variant that uses a forward repairable partition of attributes and a priority repair strategy to repair FDs sequentially, supported by efficient tuple-equivalence tracking via disjoint-set forests. Swipe guarantees termination and leverages preservative repair functions to avoid revising earlier decisions, achieving competitive repair quality. Empirical evaluation on four real datasets shows Swipe dramatically outperforms multi-sequence Llunatic in run time (1–3 orders of magnitude faster) while maintaining or improving repair effectiveness, and scales well with increasing tuple counts. These results demonstrate that a carefully designed single-path repair can rival exhaustive approaches and open avenues for extending to more expressive constraints like conditional FDs.

Abstract

The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is then called an optimal repair. If the allowed modifications are value updates, finding an optimal repair is NP-hard. A well-known approach to find approximations of optimal repairs builds a Chase tree in which each internal node resolves violations of one functional dependency and leaf nodes represent repairs. A key property of this approach is that controlling the branching factor of the Chase tree allows to control the trade-off between repair quality and computational efficiency. In this paper, we explore an extreme variant of this idea in which the Chase tree has only one path. To construct this path, we first create a partition of attributes such that classes can be repaired sequentially. We repair each class only once and do so by fixing the order in which dependencies are repaired. This principle is called priority repairing and we provide a simple heuristic to determine priority. The techniques for attribute partitioning and priority repair are combined in the Swipe algorithm. An empirical study on four real-life data sets shows that Swipe is one to three orders of magnitude faster than multi-sequence Chase-based approaches, whereas the quality of repairs is comparable or better. Moreover, a scalability analysis of the Swipe algorithm shows that Swipe scales well in terms of an increasing number of tuples.

Cleaning data with Swipe

TL;DR

Abstract

Paper Structure (28 sections, 4 theorems, 18 equations, 3 figures, 5 tables, 4 algorithms)

This paper contains 28 sections, 4 theorems, 18 equations, 3 figures, 5 tables, 4 algorithms.

Introduction
The Llunatic Chase algorithm
Single-path Chase trees with Swipe
Related work
Preliminaries
Sequential repairing
Basic definitions
Attribute partition building
Priority repair
Priority model
Tuple equivalence
Fixing violations with repair functions
Priority repair
The Swipe algorithm
Experimental evaluation
...and 13 more sections

Key Result

Proposition 1

For a schema $\mathcal{R}$ and FDs $\Phi$ defined over $\mathcal{R}$, we have $\left(\mathcal{P} \rightsquigarrow_{F} \Phi\right) \Rightarrow \left(\mathcal{P} \rightsquigarrow \Phi\right)$.

Figures (3)

Figure 1: An example cleaning scenario with hospital data Xu2013 (top) and seven FDs (middle right). Actual errors in the data are marked in grey. A partition of attributes is shown (middle left) over which FDs are forward repairable. The first partition class shows a priority model over its attributes. For this partition, a repair obtained by using majority voting with random tie breaking as repair function is shown (bottom). Correct changes are shown in green bold font and incorrect changes are shown in red bold font.
Figure 2: The construction of a preorder $P^+$ for the FDs from Figure \ref{['fig:example']} (middle). Elements derived directly from the FDs are marked with 'x' and elements added to compute the transitive closure are marked with '*'. The preorder is the union of two weak orders, marked by bold lines.
Figure 3: Mean run time (ms) of $10$ executions of Swipe in function of changing $|R|$ (left) and $|\mathcal{R}|$ (right).

Theorems & Definitions (16)

Example 1
Definition 1
Proposition 1
Example 2
Theorem 1
Example 3
Example 4
Example 5
Example 6
Example 7
...and 6 more

Cleaning data with Swipe

TL;DR

Abstract

Cleaning data with Swipe

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (16)