Resolution of Simpson's paradox via the common cause principle

A. Hovhannisyan; A. E. Allahverdyan

Resolution of Simpson's paradox via the common cause principle

A. Hovhannisyan, A. E. Allahverdyan

TL;DR

This work analyzes Simpson's paradox through the lens of the common-cause principle, showing that a minimal, possibly unobserved, common cause C can reconcile apparent inversions between aggregated and stratified associations in both discrete and continuous settings. For binary A_1,A_2,B and C, conditioning on C yields a consistent direction of the A_1–A_2 association (Theorem 1), aligning with the fine-grained (C-conditioned) view rather than the coarse (marginalized) view. In Gaussian settings, a parallel minimal formulation demonstrates that the sign of the within-group correlation under conditioning on B matches the aggregate association, supporting the same resolution (Theorem 2). The paper illustrates these ideas with real-world-like examples (smoking/survival, COVID-19 Italy vs China) and discusses the limitations when C is non-binary, where multiple paradoxical outcomes can arise; it also connects the framework to unsupervised learning concepts and causal inference principles. Overall, the results underscore the importance of explicitly modeling or inferring a minimal common cause to properly interpret probabilistic associations in the presence of Simpson's paradox.

Abstract

Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This setup generalizes the original Simpson's paradox: now its two contradicting options refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for the Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. The same conclusion is reached when Simpson's paradox is formulated via 3 continuous Gaussian variables: within the minimal formulation of the paradox (3 scalar continuous variables $A_1$, $A_2$, and $B$), one should choose the option with the conditioning over $B$.

Resolution of Simpson's paradox via the common cause principle

TL;DR

Abstract

Simpson's paradox is an obstacle to establishing a probabilistic association between two events

and

, given the third (lurking) random variable

. We focus on scenarios when the random variables

(which combines

, and their complements) and

have a common cause

that need not be observed. Alternatively, we can assume that

screens out

from

. For such cases, the correct association between

and

is to be defined via conditioning over

. This setup generalizes the original Simpson's paradox: now its two contradicting options refer to two particular and different causes

. We show that if

and

are binary and

is quaternary (the minimal and the most widespread situation for the Simpson's paradox), the conditioning over any binary common cause

establishes the same direction of association between

and

as the conditioning over

in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over

and not its marginalization. The same conclusion is reached when Simpson's paradox is formulated via 3 continuous Gaussian variables: within the minimal formulation of the paradox (3 scalar continuous variables

, and

), one should choose the option with the conditioning over

Paper Structure (24 sections, 63 equations, 1 figure)

This paper contains 24 sections, 63 equations, 1 figure.

Introduction
Formulation of Simpson's paradox and previous works
Formulation of the paradox for binary variables and its necessary conditions
Attempts to resolve the paradox
Replacing prediction with retrodiction
Exchangeability and causality
Criticism
How frequent is Simpson's paradox: an estimate based on the non-informative Dirichlet density
Common cause principle and reformulation of Simpson's paradox
Common cause and screening
A common cause (or screening variable) resolves Simpson's paradox for binary variables
Non-binary causes
Example: smoking and surviving
Example: COVID-19, Italy versus China
Simpson's paradox and common cause principle for Gaussian variables
...and 9 more sections

Figures (1)

Figure 1: Directed acyclic graphs between random variables $A=(A_1,A_2)$, $B$ and $C$ involved in discussing Simpson's paradox. The first and second graphs were studied in Refs. pearlpearl2; see (\ref{['ex1']}, \ref{['ex2']}). The third or fourth graphs are basic assumptions of this work; see (\ref{['gog']}). In the first graph, $B$ influences $A_1$ and $A_2$, but $B$ is not the common cause in the strict sense, because there is an influence from $A_2$ to $A_1$. A similar interpretation applies to the second graph. We emphasize that the joint probability $p(A_1,A_2,B)$ for the first and second graphs has the same form, i.e. such graphs are extra constructions employed for interpretation of data. In contrast, the third and fourth graph imply a definite (but the same for both graphs) limitation on the joint probability $p(A_1,A_2,B,C)$, which is expressed by (\ref{['gog']}).

Resolution of Simpson's paradox via the common cause principle

TL;DR

Abstract

Resolution of Simpson's paradox via the common cause principle

Authors

TL;DR

Abstract

Table of Contents

Figures (1)