Table of Contents
Fetching ...

Causal Direction from Convergence Time: Faster Training in the True Causal Direction

Abdulrahman Tamim

TL;DR

Causal Computational Asymmetry (CCA) is introduced, a principle for causal direction identification based on optimization dynamics in which one neural network is trained to predict Y from X and another to predict X from Y, and the direction that converges faster is inferred to be causal.

Abstract

We introduce Causal Computational Asymmetry (CCA), a principle for causal direction identification based on optimization dynamics in which one neural network is trained to predict $Y$ from $X$ and another to predict $X$ from $Y$, and the direction that converges faster is inferred to be causal. Under the additive noise model $Y = f(X) + \varepsilon$ with $\varepsilon \perp X$ and $f$ nonlinear and injective, we establish a formal asymmetry: in the reverse direction, residuals remain statistically dependent on the input regardless of approximation quality, inducing a strictly higher irreducible loss floor and non-separable gradient noise in the optimization dynamics, so that the reverse model requires strictly more gradient steps in expectation to reach any fixed loss threshold; consequently, the forward (causal) direction converges in fewer expected optimization steps. CCA operates in optimization-time space, distinguishing it from methods such as RESIT, IGCI, and SkewScore that rely on statistical independence or distributional asymmetries, and proper z-scoring of both variables is required for valid comparison of convergence rates. On synthetic benchmarks, CCA achieves 26/30 correct causal identifications across six neural architectures, including 30/30 on sine and exponential data-generating processes. We further embed CCA into a broader framework termed Causal Compression Learning (CCL), which integrates graph structure learning, causal information compression, and policy optimization, with all theoretical guarantees formally proved and empirically validated on synthetic datasets.

Causal Direction from Convergence Time: Faster Training in the True Causal Direction

TL;DR

Causal Computational Asymmetry (CCA) is introduced, a principle for causal direction identification based on optimization dynamics in which one neural network is trained to predict Y from X and another to predict X from Y, and the direction that converges faster is inferred to be causal.

Abstract

We introduce Causal Computational Asymmetry (CCA), a principle for causal direction identification based on optimization dynamics in which one neural network is trained to predict from and another to predict from , and the direction that converges faster is inferred to be causal. Under the additive noise model with and nonlinear and injective, we establish a formal asymmetry: in the reverse direction, residuals remain statistically dependent on the input regardless of approximation quality, inducing a strictly higher irreducible loss floor and non-separable gradient noise in the optimization dynamics, so that the reverse model requires strictly more gradient steps in expectation to reach any fixed loss threshold; consequently, the forward (causal) direction converges in fewer expected optimization steps. CCA operates in optimization-time space, distinguishing it from methods such as RESIT, IGCI, and SkewScore that rely on statistical independence or distributional asymmetries, and proper z-scoring of both variables is required for valid comparison of convergence rates. On synthetic benchmarks, CCA achieves 26/30 correct causal identifications across six neural architectures, including 30/30 on sine and exponential data-generating processes. We further embed CCA into a broader framework termed Causal Compression Learning (CCL), which integrates graph structure learning, causal information compression, and policy optimization, with all theoretical guarantees formally proved and empirically validated on synthetic datasets.
Paper Structure (47 sections, 16 theorems, 21 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 47 sections, 16 theorems, 21 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Lemma 4.1

Under the ANM with injective $f$, the optimal reverse regression target is $h^*(Y) = E[X \mid Y]$, and for any finite-capacity approximation $h_\phi \neq h^*$, the residuals $R_{\mathrm{rev}} = X - h_\phi(Y)$ satisfy: In contrast, the forward residuals $R_{\mathrm{fwd}} = Y - g_\theta(X)$ satisfy $\mathrm{Cov}(R_{\mathrm{fwd}}, X) \to 0$ as $g_\theta \to f$.

Figures (5)

  • Figure 1: CCA accuracy per DGP and architecture (Experiment 1). Each group of bars represents one of the six network architectures tested. Bar colors correspond to the five DGPs. The dashed line at 0.5 marks chance. Injective DGPs ($\sin$, $\exp$) achieve perfect 1.0 accuracy across every architecture. The cubic DGP without normalization (blue, 6/30) reveals the scale boundary condition; with z-scoring it recovers to 26/30. Linear Gaussian (gray) and non-injective $X^2$ (red) boundary conditions behave exactly as predicted by theory.
  • Figure 2: Forward vs. reverse convergence on $Y = X^3 + \varepsilon$ (Seed 0, z-scored, MLP-64-64-Tanh/Adam). The forward network (solid blue) crosses the convergence threshold $\tau = 0.05$ at step 161 and continues improving to below $10^{-3}$ MSE. The reverse network (dashed red) descends initially then plateaus just above $\tau$, never crossing it within the 3000-step cap. The CCA score is $161 - 3000 = -2839$, strongly predicting $X \to Y$. This 19-fold gap is substantially larger than the theoretical lower bound, because the reverse landscape also contains saddle points not captured by the PL approximation.
  • Figure 3: CCA score distribution across 30 seeds ($Y = X^3 + \varepsilon$, z-scored).Left: Bar chart of CCA scores ($T_{\mathrm{fwd}} - T_{\mathrm{rev}}$) per seed. Blue bars indicate correct identification (CCA $< 0$); red bars incorrect. 26 of 30 seeds are correct. The four exceptions are seeds where initialization variance caused the forward network to take unusually many steps. Right: Scatter of $T_{\mathrm{fwd}}$ vs. $T_{\mathrm{rev}}$. Points above the dashed diagonal ($T_{\mathrm{fwd}} = T_{\mathrm{rev}}$) are correct identifications. The cluster in the top-left corner represents seeds where the reverse hit the 3000-step cap while forward converged early -- the strongest possible CCA signal. Mean $T_{\mathrm{fwd}} = 323 \pm 531$ steps; mean $T_{\mathrm{rev}} = 717 \pm 789$ steps; reverse takes $2.2\times$ longer on average.
  • Figure 4: Boundary condition experiments, 30 seeds each.Left: Linear Gaussian ($Y = 2X + \varepsilon$). CCA $< 0$ in only 1 of 30 seeds, indistinguishable from random. Scores are near zero (note the $y$-axis scale is $[-1, +1]$): forward and reverse take nearly the same number of steps because Gaussian symmetry makes the two optimization problems identical. This is the correct predicted failure. Right: Non-injective ($Y = X^2 + \varepsilon$). CCA $< 0$ in 29 of 30 seeds with scores down to $-3000$. This is the degenerate collapse: the reverse network learns to predict zero (because $E[X \mid Y] = 0$ by symmetry of $P(X)$) in under 25 steps, while the forward network needs hundreds of steps to learn $x \mapsto x^2$. This is not CCA working -- it is a structural identifiability failure that should not be used to benchmark the method.
  • Figure 5: Tübingen Cause-Effect Pairs benchmark ($T_{\max} = 10{,}000$, z-scored, 108 pairs).Left: CCA score distribution. Blue bars are correct predictions, red bars incorrect. The vast majority of pairs score near zero (mass concentrated at the decision boundary), which corresponds to low-confidence predictions. Incorrect predictions are sparse and concentrated near zero, consistent with the boundary conditions: pairs with near-linear mechanisms or near-symmetric marginals produce weak asymmetry signals. Right: Cumulative accuracy sorted by $|\mathrm{CCA}|$ (most confident pairs first). CCA overall accuracy of 0.96 (dashed gray) substantially exceeds ANM/RESIT at 0.63 (red dotted) and chance at 0.50 (black dotted). The volatile accuracy at low confidence (leftmost part of the curve, fewest pairs evaluated) stabilizes clearly above both baselines as more pairs are included.

Theorems & Definitions (37)

  • Definition 3.1: CCL Objective
  • Lemma 4.1: Residual Dependence
  • proof
  • Lemma 4.2: Landscape Complexity
  • proof
  • Lemma 4.3: Harder Landscape, More Steps
  • proof
  • Theorem 4.4: CCA Asymmetry
  • proof
  • Remark 4.5: Scope of the Theorem
  • ...and 27 more