Finding Non-Redundant Simpson's Paradox from Multidimensional Data
Yi Yang, Jian Pei, Jun Yang, Jichun Xie
TL;DR
This work tackles Simpson's paradox in multidimensional data by exposing widespread redundancy among paradox instances and introducing an equivalence-based framework to isolate non-redundant paradoxes. It formalizes three redundancy types—sibling-child, separator, and statistic equivalence—and proves redundancy forms an equivalence relation, enabling a concise representation via convex coverage groups. The authors develop a DFS-based materialization pipeline and redundancy-aware paradox discovery that scale to large datasets and reveal robust paradox structures under perturbation. Empirical results on real and synthetic data show substantial redundancy (often a large fraction of paradoxes), significant speedups over brute-force methods, and stability of discovered patterns, underscoring practical utility for high-dimensional data analysis and causal inference.
Abstract
Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing redundant paradoxes and design efficient algorithms that integrate depth-first materialization of the base table with redundancy-aware paradox discovery. Experiments on real-world datasets and synthetic benchmarks show that redundant paradoxes are widespread, on some real datasets constituting over 40% of all paradoxes, while our algorithms scale to millions of records, reduce run time by up to 60%, and discover paradoxes that are structurally robust under data perturbation. These results demonstrate that Simpson's paradoxes can be efficiently identified, concisely summarized, and meaningfully interpreted in large multidimensional datasets.
