Table of Contents
Fetching ...

Sanity Checking Causal Representation Learning on a Simple Real-World System

Juan L. Gamella, Simon Bing, Jakob Runge

TL;DR

The paper investigates whether causal representation learning (CRL) methods can recover ground-truth causal factors from real-world observations by introducing a simple, controllable light-tunnel system with known inputs $R,G,B,\theta_1,\theta_2$ as the latent factors. It evaluates three representative CRL families—Contrastive CRL, Multiview CRL, and CITRIS—and a deterministic synthetic ablation to isolate the effect of data-generating noise. The results show that, on real data, all methods fail to recover the latent factors (with CCRL performing well only on the synthetic ablation), underscoring the fragility of current CRL methods to real-world noise and the crucial role of mixing-function assumptions. The work provides a public benchmark and datasets to drive more robust, reproducible development of CRL methods and closer alignment between identifiability theory and practical performance.

Abstract

We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors (the inputs to the experiment) are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at github.com/simonbing/CRLSanityCheck

Sanity Checking Causal Representation Learning on a Simple Real-World System

TL;DR

The paper investigates whether causal representation learning (CRL) methods can recover ground-truth causal factors from real-world observations by introducing a simple, controllable light-tunnel system with known inputs as the latent factors. It evaluates three representative CRL families—Contrastive CRL, Multiview CRL, and CITRIS—and a deterministic synthetic ablation to isolate the effect of data-generating noise. The results show that, on real data, all methods fail to recover the latent factors (with CCRL performing well only on the synthetic ablation), underscoring the fragility of current CRL methods to real-world noise and the crucial role of mixing-function assumptions. The work provides a public benchmark and datasets to drive more robust, reproducible development of CRL methods and closer alignment between identifiability theory and practical performance.

Abstract

We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors (the inputs to the experiment) are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at github.com/simonbing/CRLSanityCheck

Paper Structure

This paper contains 43 sections, 7 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: The light tunnel (left) and a simplified schematic (right) showing its main components and variables. The tunnel consists of a controllable light source, linear polarizers mounted on rotating frames, a camera, and sensors to measure light intensity at different wavelengths and positions. The inputs to the system ($R,G,B,\theta_1,\theta_2$) are displayed in bold math print in this figure. The system's outputs---image data and numeric sensor measurements---are denoted by a tilde.
  • Figure 2: (A--D): Real images collected from the light tunnel, with the corresponding control inputs overlaid in white. The light source color ($R,G,B$) is the same for images C and D, but the first polarizer is shifted by $90$ degrees in image D, showing the effect of the linear polarizers angles ($\theta_1, \theta_2$). (E, F): Comparison between (E) a real image (at low resolution) and (F) the synthetic counterpart produced by a simple multi-layer perceptron given the same inputs.
  • Figure 3: Experimental results for the Contrastive CRL method applied to the real data and the data from the synthetic ablation described in \ref{['s:testbed']}. Left: MCC score of predicting the ground-truth factors, and SHD score between the estimated latent graph and the ground-truth graph (shown on the left). We provide the average and standard deviation ($\pm$) of the scores over five random initializations of the method. Right: Ground-truth graph and a summary of the estimated latent graphs produced by five random initializations of the method. The edges are shaded and labeled according to the frequency they appear in the estimates obtained in the five runs of the method, with darker edges appearing more often. The method performs well on the data from the synthetic ablation, where a deterministic simulator (\ref{['s:simulator']}) substitutes the data-generating process of the light tunnel. The method fails to produce meaningful results based on the real data.
  • Figure 4: Ground-truth graph relating the underlying causal factors, shown in bold print, to the different views employed in the Multiview CRL experiment. The views are disjoint sets of the output variables produced by the tunnel. The separate factors $(R,G,B)$ are shown as a tuple to avoid drawing additional edges. The graph is a subgraph of the complete causal ground-truth graph for the light tunnel, described in gamella2025chamber.
  • Figure 5: Experimental results for the multiview CRL method over five random initializations. (A, B, C): Average $R^2$-score of predicting the ground-truth factors from the learned representation across different pairs of views. Success is indicated by simultaneously attaining a score near one for the factors in the content block and a significantly lower score for those in the style block. While content variables are consistently predicted better than style variables, the model fails to disentangle $\theta_1$ and $\theta_2$, evident in panel B & C. (D): Scatter plot of the ground-truth factor $\theta_2$ vs. the corresponding sensor measurement in view 4 ($\tilde{\theta}_2$). Even though this view and the ground-truth factor are almost perfectly correlated, the model fails to recover the underlying factor $\theta_2$.
  • ...and 7 more figures