Table of Contents
Fetching ...

Finding Non-Redundant Simpson's Paradox from Multidimensional Data

Yi Yang, Jian Pei, Jun Yang, Jichun Xie

TL;DR

This work tackles Simpson's paradox in multidimensional data by exposing widespread redundancy among paradox instances and introducing an equivalence-based framework to isolate non-redundant paradoxes. It formalizes three redundancy types—sibling-child, separator, and statistic equivalence—and proves redundancy forms an equivalence relation, enabling a concise representation via convex coverage groups. The authors develop a DFS-based materialization pipeline and redundancy-aware paradox discovery that scale to large datasets and reveal robust paradox structures under perturbation. Empirical results on real and synthetic data show substantial redundancy (often a large fraction of paradoxes), significant speedups over brute-force methods, and stability of discovered patterns, underscoring practical utility for high-dimensional data analysis and causal inference.

Abstract

Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing redundant paradoxes and design efficient algorithms that integrate depth-first materialization of the base table with redundancy-aware paradox discovery. Experiments on real-world datasets and synthetic benchmarks show that redundant paradoxes are widespread, on some real datasets constituting over 40% of all paradoxes, while our algorithms scale to millions of records, reduce run time by up to 60%, and discover paradoxes that are structurally robust under data perturbation. These results demonstrate that Simpson's paradoxes can be efficiently identified, concisely summarized, and meaningfully interpreted in large multidimensional datasets.

Finding Non-Redundant Simpson's Paradox from Multidimensional Data

TL;DR

This work tackles Simpson's paradox in multidimensional data by exposing widespread redundancy among paradox instances and introducing an equivalence-based framework to isolate non-redundant paradoxes. It formalizes three redundancy types—sibling-child, separator, and statistic equivalence—and proves redundancy forms an equivalence relation, enabling a concise representation via convex coverage groups. The authors develop a DFS-based materialization pipeline and redundancy-aware paradox discovery that scale to large datasets and reveal robust paradox structures under perturbation. Empirical results on real and synthetic data show substantial redundancy (often a large fraction of paradoxes), significant speedups over brute-force methods, and stability of discovered patterns, underscoring practical utility for high-dimensional data analysis and causal inference.

Abstract

Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing redundant paradoxes and design efficient algorithms that integrate depth-first materialization of the base table with redundancy-aware paradox discovery. Experiments on real-world datasets and synthetic benchmarks show that redundant paradoxes are widespread, on some real datasets constituting over 40% of all paradoxes, while our algorithms scale to millions of records, reduce run time by up to 60%, and discover paradoxes that are structurally robust under data perturbation. These results demonstrate that Simpson's paradoxes can be efficiently identified, concisely summarized, and meaningfully interpreted in large multidimensional datasets.

Paper Structure

This paper contains 45 sections, 11 theorems, 14 equations, 7 figures, 4 tables, 7 algorithms.

Key Result

lemma 1

Consider two association configurations $p = (s_1, s_2, X, Y)$ and $p' = (s'_1, s'_2, X, Y)$ where $\operatorname{\mathsf{cov}}(s_1) = \operatorname{\mathsf{cov}}(s'_1)$ and $\operatorname{\mathsf{cov}}(s_2) = \operatorname{\mathsf{cov}}(s'_2)$. If $p$ is a Simpson's paradox, then $p'$ is also a Sim

Figures (7)

  • Figure 1: Hasse diagram of the lattice formed by all populations in \ref{['tab:ex1']} with respect to the parent-child relation $\mathbin{\dot\succ}$. A parent is placed lower than its child. The blue and green subsets are convex, while the orange subset is non-convex.
  • Figure 2: Distribution of the number of Simpson's paradoxes per redundant group in four real-world datasets.
  • Figure 3: Effect of dataset parameters on the total number of Simpson's paradoxes (orange) and redundant paradox groups (blue) in synthetic data.
  • Figure 4: (a) Run time comparison on real-world datasets. Yellow shaded regions represent materialization time. (b) Run time vs. pruning threshold on real-world datasets.
  • Figure 5: Run time scaling with synthetic dataset parameters. Solid lines denote total run time; dotted lines denote materialization time.
  • ...and 2 more figures

Theorems & Definitions (15)

  • definition 1
  • lemma 1: Sibling child equivalence
  • lemma 2: Separator equivalence
  • lemma 3: Statistic equivalence
  • definition 2: Redundancy
  • theorem 1: Equivalence
  • lemma 4: Product Space
  • definition 3: Concise representation
  • theorem 2: #P-Hardness
  • theorem 3: Completeness
  • ...and 5 more