Analyzing Deviations from Monotonic Trends through Database Repair
Shunit Agmon, Jonathan Gal, Amir Gilad, Ester Livshits, Or Mutay, Brit Youngmann, Benny Kimelfeld
TL;DR
The paper introduces Aggregate Order Dependencies (AODs) to quantify deviations from monotonic trends in relational data and formalizes a largest-monotone-subset repair (C-repair) problem. It develops a general dynamic-programming framework CardRepair, with polynomial-time instantiations for max, min, count, countd, and median, and pseudo-polynomial solutions for sum and avg, plus a suite of optimizations (Holistic Packing, pruning, and heuristics). An efficient heuristic, HeurRepair, provides fast approximate repairs and useful bounds to prune exact DP search, with experimental results showing substantial speedups and generally competitive repair quality across real and synthetic datasets. The work demonstrates practical applicability for diagnosing and explaining trend violations, comparing against outlier-based methods, and highlighting opportunities for extending the framework to multiple AODs, joins, and alternative deletion strategies. Overall, AOD repair offers a principled, data-centric way to measure and explain monotonicity violations in large datasets with diverse aggregation semantics.
Abstract
Datasets often exhibit violations of expected monotonic trends - for example, higher education level correlating with higher average salary, newer homes being more expensive, or diabetes prevalence increasing with age. We address the problem of quantifying how far a dataset deviates from such trends. To this end, we introduce Aggregate Order Dependencies (AODs), an aggregation-centric extension of the previously studied order dependencies. An AOD specifies that the aggregated value of a target attribute (e.g., mean salary) should monotonically increase or decrease with the grouping attribute (e.g., education level). We formulate the AOD repair problem as finding the smallest set of tuples to delete from a table so that the given AOD is satisfied. We analyze the computational complexity of this problem and propose a general algorithmic template for solving it. We instantiate the template for common aggregation functions, introduce optimization techniques that substantially improve the runtime of the template instances, and develop efficient heuristic alternatives. Our experimental study, carried out on both real-world and synthetic datasets, demonstrates the practical efficiency of the algorithms and provides insight into the performance of the heuristics. We also present case studies that uncover and explain unexpected AOD violations using our framework.
