Table of Contents
Fetching ...

Causal Search for Skylines (CSS): Causally-Informed Selective Data De-Correlation

Pratanu Mandal, Abhinav Gorantla, K. Selçuk Candan, Maria Luisa Sapino

Abstract

Skyline queries are popular and effective tools in multi-criteria decision support as they extract interesting (pareto-optimal) points that help summarize the available data with respect to a given set of preference attributes. Unfortunately, the efficiency of the skyline algorithms depends heavily on the underlying data statistics. In this paper, we argue that the efficiency of the skyline algorithms could be significantly boosted if one could erase any attribute correlations that do not agree with the preference criteria, while preserving (or even boosting) correlations that agree with the user provided criteria. Therefore, we propose a causallyinformed selective de-correlation mechanism to enable skyline algorithms to better leverage the pruning opportunities provided by the positively-aligned data distributions, without having to suffer from the mis-alignments. In particular, we show that, given a causal graph that describes the underlying causal structure of the data, one can identify a subset of the attributes that can be used to selectively de-correlate the preference attributes. Importantly, the proposed causal search for skylines (CSS) approach is agnostic to the underlying candidate enumeration and pruning strategies and, therefore, can be leveraged to improve any popular skyline discovery algorithm. Experiments on multiple real and synthetic data sets and for different skyline discovery algorithms show that the proposed causally-informed selective de-correlation technique significantly reduces both the number of dominance checks as well as the overall time needed to locate skyline points.

Causal Search for Skylines (CSS): Causally-Informed Selective Data De-Correlation

Abstract

Skyline queries are popular and effective tools in multi-criteria decision support as they extract interesting (pareto-optimal) points that help summarize the available data with respect to a given set of preference attributes. Unfortunately, the efficiency of the skyline algorithms depends heavily on the underlying data statistics. In this paper, we argue that the efficiency of the skyline algorithms could be significantly boosted if one could erase any attribute correlations that do not agree with the preference criteria, while preserving (or even boosting) correlations that agree with the user provided criteria. Therefore, we propose a causallyinformed selective de-correlation mechanism to enable skyline algorithms to better leverage the pruning opportunities provided by the positively-aligned data distributions, without having to suffer from the mis-alignments. In particular, we show that, given a causal graph that describes the underlying causal structure of the data, one can identify a subset of the attributes that can be used to selectively de-correlate the preference attributes. Importantly, the proposed causal search for skylines (CSS) approach is agnostic to the underlying candidate enumeration and pruning strategies and, therefore, can be leveraged to improve any popular skyline discovery algorithm. Experiments on multiple real and synthetic data sets and for different skyline discovery algorithms show that the proposed causally-informed selective de-correlation technique significantly reduces both the number of dominance checks as well as the overall time needed to locate skyline points.
Paper Structure (62 sections, 61 equations, 27 figures, 3 tables, 2 algorithms)

This paper contains 62 sections, 61 equations, 27 figures, 3 tables, 2 algorithms.

Figures (27)

  • Figure 1: Running example: skylines for house hunting
  • Figure 2: Impact of the alignment between data distribution and preference criteria: in (a) the dominance region is determined by a few extreme points, whereas in (b) there are a large number of skyline points on the Pareto front
  • Figure 3: Impact of clustering on the (a) confounder Z: (b) within each cluster, the correlation between $X$ and $Y$ is close to zero, and (c) seeking the skyline among the merged Pareto fronts (orange dots) across clusters is much less expensive, as many tuples have already been pruned
  • Figure 4: Three basic causal structures
  • Figure 5: (a,b) The fork attribute $A$ imposes negative correlation on attributes $X$ and $Y$; (c,d) conditioning of the attribute $A$ creates multiple clusters of data points, each cluster lacking any negative correlation; in (e), we highlight two of these resulting clusters in red and green
  • ...and 22 more figures

Theorems & Definitions (11)

  • Example 1: House hunting
  • Definition 1: Causal Graph
  • Example 2: Weakening or Strengthening of Relationships among Preference Attributes
  • Definition 2: Conditioning
  • Example 3: Conditioning to Weaken Negative Correlations among the Preference Attributes
  • Example 4: Leaky Conditioning/Blocking
  • Definition 3: Tuple dominance ($dom$)
  • Definition 4: Skyline
  • Example 5: Causal graph for house hunting
  • Definition 5: Causal Path
  • ...and 1 more