Finding Non-Redundant Simpson's Paradox from Multidimensional Data

Yi Yang; Jian Pei; Jun Yang; Jichun Xie

Finding Non-Redundant Simpson's Paradox from Multidimensional Data

Yi Yang, Jian Pei, Jun Yang, Jichun Xie

TL;DR

This work tackles Simpson's paradox in multidimensional data by exposing widespread redundancy among paradox instances and introducing an equivalence-based framework to isolate non-redundant paradoxes. It formalizes three redundancy types—sibling-child, separator, and statistic equivalence—and proves redundancy forms an equivalence relation, enabling a concise representation via convex coverage groups. The authors develop a DFS-based materialization pipeline and redundancy-aware paradox discovery that scale to large datasets and reveal robust paradox structures under perturbation. Empirical results on real and synthetic data show substantial redundancy (often a large fraction of paradoxes), significant speedups over brute-force methods, and stability of discovered patterns, underscoring practical utility for high-dimensional data analysis and causal inference.

Abstract

Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing redundant paradoxes and design efficient algorithms that integrate depth-first materialization of the base table with redundancy-aware paradox discovery. Experiments on real-world datasets and synthetic benchmarks show that redundant paradoxes are widespread, on some real datasets constituting over 40% of all paradoxes, while our algorithms scale to millions of records, reduce run time by up to 60%, and discover paradoxes that are structurally robust under data perturbation. These results demonstrate that Simpson's paradoxes can be efficiently identified, concisely summarized, and meaningfully interpreted in large multidimensional datasets.

Finding Non-Redundant Simpson's Paradox from Multidimensional Data

TL;DR

Abstract

Finding Non-Redundant Simpson's Paradox from Multidimensional Data

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (15)