Table of Contents
Fetching ...

Causal Representation Learning on High-Dimensional Data: Benchmarks, Reproducibility, and Evaluation Metrics

Alireza Sadeghi, Wael AbdAlmageed

Abstract

Causal representation learning (CRL) models aim to transform high-dimensional data into a latent space, enabling interventions to generate counterfactual samples or modify existing data based on the causal relationships among latent variables. To facilitate the development and evaluation of these models, a variety of synthetic and real-world datasets have been proposed, each with distinct advantages and limitations. For practical applications, CRL models must perform robustly across multiple evaluation directions, including reconstruction, disentanglement, causal discovery, and counterfactual reasoning, using appropriate metrics for each direction. However, this multi-directional evaluation can complicate model comparison, as a model may excel in some direction while under-performing in others. Another significant challenge in this field is reproducibility: the source code corresponding to published results must be publicly available, and repeated runs should yield performance consistent with the original reports. In this study, we critically analyzed the synthetic and real-world datasets currently employed in the literature, highlighting their limitations and proposing a set of essential characteristics for suitable datasets in CRL model development. We also introduce a single aggregate metric that consolidates performance across all evaluation directions, providing a comprehensive score for each model. Finally, we reviewed existing implementations from the literature and assessed them in terms of reproducibility, identifying gaps and best practices in the field.

Causal Representation Learning on High-Dimensional Data: Benchmarks, Reproducibility, and Evaluation Metrics

Abstract

Causal representation learning (CRL) models aim to transform high-dimensional data into a latent space, enabling interventions to generate counterfactual samples or modify existing data based on the causal relationships among latent variables. To facilitate the development and evaluation of these models, a variety of synthetic and real-world datasets have been proposed, each with distinct advantages and limitations. For practical applications, CRL models must perform robustly across multiple evaluation directions, including reconstruction, disentanglement, causal discovery, and counterfactual reasoning, using appropriate metrics for each direction. However, this multi-directional evaluation can complicate model comparison, as a model may excel in some direction while under-performing in others. Another significant challenge in this field is reproducibility: the source code corresponding to published results must be publicly available, and repeated runs should yield performance consistent with the original reports. In this study, we critically analyzed the synthetic and real-world datasets currently employed in the literature, highlighting their limitations and proposing a set of essential characteristics for suitable datasets in CRL model development. We also introduce a single aggregate metric that consolidates performance across all evaluation directions, providing a comprehensive score for each model. Finally, we reviewed existing implementations from the literature and assessed them in terms of reproducibility, identifying gaps and best practices in the field.
Paper Structure (13 sections, 2 equations, 7 figures, 2 tables)

This paper contains 13 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Fundamental causal junction types that constitute the building blocks of causal graphs.
  • Figure 2: a) Causal graphs for CelebA(SMILE) and CelebA(BEARD), two commonly used subsets of the CelebA dataset for CRL model development. b) Illustration of an unseen variable $X$ acting as a potential confounder, which may explain the dependencies observed between age and gender in CelebA(BEARD), and between gender and smile in CelebA(SMILE). c) Causal graphs for additional CelebA subsets constructed and utilized by huang2025visual, with subset names consistent with those used in their original study.
  • Figure 3: ausal graphs for Pendulum, Flow Noise, Shadow(SunLight), and Shadow(PointLight), four commonly used synthetic datasets for CRL model development.
  • Figure 4: Illustration showing that if a CRL model performs well in all directions except one, its real-world applicability is compromised, analogous to a car with a single flat tire impairing overall system performance.
  • Figure 5: Reconstruction performance of different models. Although CausalVAE outperforms the other two models in MIC and TIC metrics, its reconstruction ability is inferior, highlighting a critical limitation for practical applications and underscoring the often-overlooked importance of reconstruction in CRL models
  • ...and 2 more figures