Causal vs. Anticausal merging of predictors

Sergio Hernan Garrido Mejia; Patrick Blöbaum; Bernhard Schölkopf; Dominik Janzing

Causal vs. Anticausal merging of predictors

Sergio Hernan Garrido Mejia, Patrick Blöbaum, Bernhard Schölkopf, Dominik Janzing

TL;DR

The paper addresses how causal versus anticausal assumptions affect merging of predictors using the MAXENT framework, focusing on a simple setup with a binary target $Y$ and two continuous predictors $X$. It shows that when all first and second moments are observed, the causal direction yields a logistic regression predictor for $p(Y|X)$, while the anticausal direction yields Linear Discriminant Analysis, linking CMAXENT to these classical classifiers. It further investigates partial knowledge of moments and the resulting Out-Of-Variable generalisation, deriving how decision boundaries shift under incomplete information and establishing when slopes may remain equal. The work illuminates intrinsic asymmetries between causal directions in predictor merging, with implications for transfer learning, domain adaptation, and federated/mixture-of-experts settings, where causal structure informs how to combine heterogeneous predictors, formalised through $p(Y|X)$ under $CMAXENT$.

Abstract

We study the differences arising from merging predictors in the causal and anticausal directions using the same data. In particular we study the asymmetries that arise in a simple model where we merge the predictors using one binary variable as target and two continuous variables as predictors. We use Causal Maximum Entropy (CMAXENT) as inductive bias to merge the predictors, however, we expect similar differences to hold also when we use other merging methods that take into account asymmetries between cause and effect. We show that if we observe all bivariate distributions, the CMAXENT solution reduces to a logistic regression in the causal direction and Linear Discriminant Analysis (LDA) in the anticausal direction. Furthermore, we study how the decision boundaries of these two solutions differ whenever we observe only some of the bivariate distributions implications for Out-Of-Variable (OOV) generalisation.

Causal vs. Anticausal merging of predictors

TL;DR

The paper addresses how causal versus anticausal assumptions affect merging of predictors using the MAXENT framework, focusing on a simple setup with a binary target

and two continuous predictors

. It shows that when all first and second moments are observed, the causal direction yields a logistic regression predictor for

, while the anticausal direction yields Linear Discriminant Analysis, linking CMAXENT to these classical classifiers. It further investigates partial knowledge of moments and the resulting Out-Of-Variable generalisation, deriving how decision boundaries shift under incomplete information and establishing when slopes may remain equal. The work illuminates intrinsic asymmetries between causal directions in predictor merging, with implications for transfer learning, domain adaptation, and federated/mixture-of-experts settings, where causal structure informs how to combine heterogeneous predictors, formalised through

under

Abstract

Paper Structure (24 sections, 13 theorems, 44 equations, 3 figures)

This paper contains 24 sections, 13 theorems, 44 equations, 3 figures.

Introduction
Notation and preliminaries
Notation
Maximum Entropy and Causal Maximum Entropy
Known predictor covariances
The causal direction
The anticausal direction
The predictor of $Y$ in the anticausal direction
The geometry of the decision boundaries
What are the differences?
Partially known covariances
Unknown predictor-target covariance
Unknown predictor covariance
Discussion
Relation between the expectations of the Mixture of Gaussians and the known marginal expectations
...and 9 more sections

Key Result

Proposition 0

Using the Lagrange multiplier formalism for the optimisation problems in eq:causalMarginalOptimisastioneq:causalConditionalOptimisastion we obtain: (i) a multivariate Gaussian distribution for $P(\mathbf{X})$, and (ii) the density of $Y$ conditioned on $\mathbf{X}$ given by where $\alpha(\mathbf{x})$ is a normalising constant. The density can be written as

Figures (3)

Figure 1: Causal graphs analysed throughout the article
Figure 2: Decision boundaries of the solution of CMAXENT in the causal (left) and anticausal (right) direction when we do not have the covariance between the predictor variables $\bar{s}_{1,2}$.
Figure 3: Graph in the causal and anticausal direction

Theorems & Definitions (21)

Proposition 0: Resulting predictor in the causal direction
Remark 1
Proposition 1: Resulting predictor in the anticausal direction
Remark 2
Theorem 3: Predictor of $Y$ using Bayes' rule
Corollary 4: Quadratic Discriminant Analysis (QDA)
Corollary 5: Exponential family discriminant analysis
Remark 6
Proposition 6: Normal vector to the decision boundaries in causal and anticausal direction
Theorem 7: Slope of the decision boundary is the same in causal and anticausal direction
...and 11 more

Causal vs. Anticausal merging of predictors

TL;DR

Abstract

Causal vs. Anticausal merging of predictors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (21)