Table of Contents
Fetching ...

Algorithmic causal structure emerging through compression

Liang Wendong, Simon Buchholz, Bernhard Schölkopf

TL;DR

The paper investigates how causal and symmetric structures can emerge from data compression when data originate from multiple environments and intervention targets are unknown. It introduces algorithmic causality, modeling causal mechanisms as CFMPs implemented by Turing machines, and uses UFCC-based finite codebook bounds to select causal directions by minimizing total code length. Through theoretical results and case studies on causal factorizations and symmetries, it demonstrates that compression-driven model selection can reveal causal structure even without identifiability. The empirical and theoretical findings suggest that large-scale models, such as language models, may exhibit emergent algorithmic causality as a by-product of data compression and shared mechanisms. This framework offers a complementary lens to Pearlian causality, focusing on regularities captured by algorithmic minimality rather than interventions alone.

Abstract

We explore the relationship between causality, symmetry, and compression. We build on and generalize the known connection between learning and compression to a setting where causal models are not identifiable. We propose a framework where causality emerges as a consequence of compressing data across multiple environments. We define algorithmic causality as an alternative definition of causality when traditional assumptions for causal identifiability do not hold. We demonstrate how algorithmic causal and symmetric structures can emerge from minimizing upper bounds on Kolmogorov complexity, without knowledge of intervention targets. We hypothesize that these insights may also provide a novel perspective on the emergence of causality in machine learning models, such as large language models, where causal relationships may not be explicitly identifiable.

Algorithmic causal structure emerging through compression

TL;DR

The paper investigates how causal and symmetric structures can emerge from data compression when data originate from multiple environments and intervention targets are unknown. It introduces algorithmic causality, modeling causal mechanisms as CFMPs implemented by Turing machines, and uses UFCC-based finite codebook bounds to select causal directions by minimizing total code length. Through theoretical results and case studies on causal factorizations and symmetries, it demonstrates that compression-driven model selection can reveal causal structure even without identifiability. The empirical and theoretical findings suggest that large-scale models, such as language models, may exhibit emergent algorithmic causality as a by-product of data compression and shared mechanisms. This framework offers a complementary lens to Pearlian causality, focusing on regularities captured by algorithmic minimality rather than interventions alone.

Abstract

We explore the relationship between causality, symmetry, and compression. We build on and generalize the known connection between learning and compression to a setting where causal models are not identifiable. We propose a framework where causality emerges as a consequence of compressing data across multiple environments. We define algorithmic causality as an alternative definition of causality when traditional assumptions for causal identifiability do not hold. We demonstrate how algorithmic causal and symmetric structures can emerge from minimizing upper bounds on Kolmogorov complexity, without knowledge of intervention targets. We hypothesize that these insights may also provide a novel perspective on the emergence of causality in machine learning models, such as large language models, where causal relationships may not be explicitly identifiable.

Paper Structure

This paper contains 18 sections, 7 theorems, 11 equations, 6 figures, 1 table.

Key Result

Lemma 1

(Identifiability implies uniqueness of solution of minimum cross-entropy)For readability we stay in the unconfounded setting and the strong version of identifiability. We can readily generalize def:identCDCRL and this lemma to identifiability up to an equivalence class, or generalize to the setting is unique.

Figures (6)

  • Figure 1: Illustration of a CFMP (\ref{['def:cond_feat_mechanism_program']}). A CFMP $\alpha$ is a Turing machine that sequentially proceeds in three steps in red given any input in $\mathcal{X}^d$. Probabilistic mechanisms are blue and feature mechanisms are green. $\epsilon$ denotes the empty string. Before reading the input tape, $\alpha$ proceeds in two steps: generates $\mathcal{P}_\alpha, \Phi_\alpha$, and featurizes the probabilistic mechanisms. In the third step, $\alpha$ multiplies the conditional probabilities it needs for calculating $\mathbb{P}(x)$ and marginalizes over latent variables if there are hidden-variable mechanisms. We emphasize that this figure, or CFMP, is not a process of learning, but just a model in the model class where we proceed model selection.
  • Figure 2: Given $(m,n,d)$, we consider the codebooks on $\mathcal{X}^d:=(\mathcal{B}^m)^d$ with precision $(m,n)$. The curves are fictitious for illustration, since some of them are not computable. Left figure: The x-axis is the index of Turing machines in an effective enumeration of all Turing machines that can compute a codebook. The y-axis is the coding length using a universal Turing machine or a UFCC. Right figure: The x-axis is the index of codebooks in an effective enumeration of all codebooks. The y-axis is the minimum coding length of the Turing machine or FCM that the universal Turing machine or UFCC can simulate. Blue line: Turing machines simulated by an arbitrary universal Turing machine. Green line: FCMs simulated by a universal two-layer neural network computer. Yellow line: FCMs simulated by $U_{\text{TabCBN}}$ defined in \ref{['prop:factorize_shorter']}. Red line: FCMs simulated by $U_{\text{unif}}$ defined above.
  • Figure 3: Illustration of \ref{['prop:SMS']}. Using $U_{\text{compCBN}}$, the difference in coding length of strategy 1 minus strategy 2 is initially positive and decreases with $k$, then becomes negative. Under $U_{\text{compCBN}}$, the objective \ref{['eq:two-part-objective']} minimizes the sum of Shannon code length and the length of CFMP. Strategy 2 is preferred when using few featurized mechanisms is precise enough for the CFMP to model the distribution of the multi-env system.
  • Figure 4: Results in \ref{['sec:exp_cov_shifts']}. The left figure shows the minimal negative log-likelihood of the CFMPs that use $k$ mechanisms $\mathbb{P}(X|E)$. The right figure shows the minimal FC complexity (NLL+model coding length $2l_{U_{\text{CompCBN}}}(\alpha)+1$ (\ref{['eq:two-part-objective']})) of the CFMPs that use $k$ mechanisms $\mathbb{P}(X|E)$. We choose 3 different multi-env distributions to generate the data, respectively using 2,4 and 7 mechanisms among 10 environments. The experiments are run with 5 seeds. The argmin $k$ are highlighted.
  • Figure 5: For the experiment with ground truth $k=7$ for 10 envs, we increase the number of samples per env. As the number increases, the model selected by minimizing FC complexity tends to use more mechanisms in different environments.
  • ...and 1 more figures

Theorems & Definitions (28)

  • Definition 1: Identifiability in causal discovery
  • Lemma 1
  • Definition 2: informal, algorithmic causality
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Example 1: multi-env CBN
  • ...and 18 more