Cleaning data with Swipe
Toon Boeckling, Antoon Bronselaer
TL;DR
This work tackles repairing databases under functional dependencies by value updates to minimize total change cost, a problem known to be NP-hard. It introduces Swipe, a single-path Chase variant that uses a forward repairable partition of attributes and a priority repair strategy to repair FDs sequentially, supported by efficient tuple-equivalence tracking via disjoint-set forests. Swipe guarantees termination and leverages preservative repair functions to avoid revising earlier decisions, achieving competitive repair quality. Empirical evaluation on four real datasets shows Swipe dramatically outperforms multi-sequence Llunatic in run time (1–3 orders of magnitude faster) while maintaining or improving repair effectiveness, and scales well with increasing tuple counts. These results demonstrate that a carefully designed single-path repair can rival exhaustive approaches and open avenues for extending to more expressive constraints like conditional FDs.
Abstract
The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is then called an optimal repair. If the allowed modifications are value updates, finding an optimal repair is NP-hard. A well-known approach to find approximations of optimal repairs builds a Chase tree in which each internal node resolves violations of one functional dependency and leaf nodes represent repairs. A key property of this approach is that controlling the branching factor of the Chase tree allows to control the trade-off between repair quality and computational efficiency. In this paper, we explore an extreme variant of this idea in which the Chase tree has only one path. To construct this path, we first create a partition of attributes such that classes can be repaired sequentially. We repair each class only once and do so by fixing the order in which dependencies are repaired. This principle is called priority repairing and we provide a simple heuristic to determine priority. The techniques for attribute partitioning and priority repair are combined in the Swipe algorithm. An empirical study on four real-life data sets shows that Swipe is one to three orders of magnitude faster than multi-sequence Chase-based approaches, whereas the quality of repairs is comparable or better. Moreover, a scalability analysis of the Swipe algorithm shows that Swipe scales well in terms of an increasing number of tuples.
