Resolution of Simpson's paradox via the common cause principle
A. Hovhannisyan, A. E. Allahverdyan
TL;DR
This work analyzes Simpson's paradox through the lens of the common-cause principle, showing that a minimal, possibly unobserved, common cause C can reconcile apparent inversions between aggregated and stratified associations in both discrete and continuous settings. For binary A_1,A_2,B and C, conditioning on C yields a consistent direction of the A_1–A_2 association (Theorem 1), aligning with the fine-grained (C-conditioned) view rather than the coarse (marginalized) view. In Gaussian settings, a parallel minimal formulation demonstrates that the sign of the within-group correlation under conditioning on B matches the aggregate association, supporting the same resolution (Theorem 2). The paper illustrates these ideas with real-world-like examples (smoking/survival, COVID-19 Italy vs China) and discusses the limitations when C is non-binary, where multiple paradoxical outcomes can arise; it also connects the framework to unsupervised learning concepts and causal inference principles. Overall, the results underscore the importance of explicitly modeling or inferring a minimal common cause to properly interpret probabilistic associations in the presence of Simpson's paradox.
Abstract
Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This setup generalizes the original Simpson's paradox: now its two contradicting options refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for the Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. The same conclusion is reached when Simpson's paradox is formulated via 3 continuous Gaussian variables: within the minimal formulation of the paradox (3 scalar continuous variables $A_1$, $A_2$, and $B$), one should choose the option with the conditioning over $B$.
