Table of Contents
Fetching ...

Resolution of Simpson's paradox via the common cause principle

A. Hovhannisyan, A. E. Allahverdyan

TL;DR

This work analyzes Simpson's paradox through the lens of the common-cause principle, showing that a minimal, possibly unobserved, common cause C can reconcile apparent inversions between aggregated and stratified associations in both discrete and continuous settings. For binary A_1,A_2,B and C, conditioning on C yields a consistent direction of the A_1–A_2 association (Theorem 1), aligning with the fine-grained (C-conditioned) view rather than the coarse (marginalized) view. In Gaussian settings, a parallel minimal formulation demonstrates that the sign of the within-group correlation under conditioning on B matches the aggregate association, supporting the same resolution (Theorem 2). The paper illustrates these ideas with real-world-like examples (smoking/survival, COVID-19 Italy vs China) and discusses the limitations when C is non-binary, where multiple paradoxical outcomes can arise; it also connects the framework to unsupervised learning concepts and causal inference principles. Overall, the results underscore the importance of explicitly modeling or inferring a minimal common cause to properly interpret probabilistic associations in the presence of Simpson's paradox.

Abstract

Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This setup generalizes the original Simpson's paradox: now its two contradicting options refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for the Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. The same conclusion is reached when Simpson's paradox is formulated via 3 continuous Gaussian variables: within the minimal formulation of the paradox (3 scalar continuous variables $A_1$, $A_2$, and $B$), one should choose the option with the conditioning over $B$.

Resolution of Simpson's paradox via the common cause principle

TL;DR

This work analyzes Simpson's paradox through the lens of the common-cause principle, showing that a minimal, possibly unobserved, common cause C can reconcile apparent inversions between aggregated and stratified associations in both discrete and continuous settings. For binary A_1,A_2,B and C, conditioning on C yields a consistent direction of the A_1–A_2 association (Theorem 1), aligning with the fine-grained (C-conditioned) view rather than the coarse (marginalized) view. In Gaussian settings, a parallel minimal formulation demonstrates that the sign of the within-group correlation under conditioning on B matches the aggregate association, supporting the same resolution (Theorem 2). The paper illustrates these ideas with real-world-like examples (smoking/survival, COVID-19 Italy vs China) and discusses the limitations when C is non-binary, where multiple paradoxical outcomes can arise; it also connects the framework to unsupervised learning concepts and causal inference principles. Overall, the results underscore the importance of explicitly modeling or inferring a minimal common cause to properly interpret probabilistic associations in the presence of Simpson's paradox.

Abstract

Simpson's paradox is an obstacle to establishing a probabilistic association between two events and , given the third (lurking) random variable . We focus on scenarios when the random variables (which combines , , and their complements) and have a common cause that need not be observed. Alternatively, we can assume that screens out from . For such cases, the correct association between and is to be defined via conditioning over . This setup generalizes the original Simpson's paradox: now its two contradicting options refer to two particular and different causes . We show that if and are binary and is quaternary (the minimal and the most widespread situation for the Simpson's paradox), the conditioning over any binary common cause establishes the same direction of association between and as the conditioning over in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over and not its marginalization. The same conclusion is reached when Simpson's paradox is formulated via 3 continuous Gaussian variables: within the minimal formulation of the paradox (3 scalar continuous variables , , and ), one should choose the option with the conditioning over .
Paper Structure (24 sections, 63 equations, 1 figure)

This paper contains 24 sections, 63 equations, 1 figure.

Figures (1)

  • Figure 1: Directed acyclic graphs between random variables $A=(A_1,A_2)$, $B$ and $C$ involved in discussing Simpson's paradox. The first and second graphs were studied in Refs. pearlpearl2; see (\ref{['ex1']}, \ref{['ex2']}). The third or fourth graphs are basic assumptions of this work; see (\ref{['gog']}). In the first graph, $B$ influences $A_1$ and $A_2$, but $B$ is not the common cause in the strict sense, because there is an influence from $A_2$ to $A_1$. A similar interpretation applies to the second graph. We emphasize that the joint probability $p(A_1,A_2,B)$ for the first and second graphs has the same form, i.e. such graphs are extra constructions employed for interpretation of data. In contrast, the third and fourth graph imply a definite (but the same for both graphs) limitation on the joint probability $p(A_1,A_2,B,C)$, which is expressed by (\ref{['gog']}).