Table of Contents
Fetching ...

The causal structure of galactic astrophysics

Harry Desmond, Joseph Ramsey

TL;DR

The paper addresses the limitation of correlation-only analyses in astrophysics by applying causal discovery to a large sample of low-redshift galaxies. It develops and uses the FCIT algorithm to infer causal structure, calibrating with mock data from a Causal Perceptron Network and applying to NSA galaxies to reveal causal links such as mass driving size and morphology, and star formation shaping luminosity. The results demonstrate ~90% edge-recovery accuracy on mocks and identify a physically interpretable causal backbone, while highlighting latent confounders and observational biases that complicate interpretation. This approach provides a principled, direction-aware framework to constrain galaxy evolution theories and motivates further methodological and data enhancements.

Abstract

Data-driven astrophysics currently relies on the detection and characterisation of correlations between objects' properties, which are then used to test physical theories that make predictions for them. This process fails to utilise information in the data that forms a crucial part of the theories' predictions, namely which variables are directly correlated (as opposed to accidentally correlated through others), the directions of these determinations, and the presence or absence of confounders that correlate variables in the dataset but are themselves absent from it. We propose to recover this information through causal discovery, a well-developed methodology for inferring the causal structure of datasets that is however almost entirely unknown to astrophysics. We develop a causal discovery algorithm suitable for large astrophysical datasets and illustrate it on $\sim$5$\times10^5$ low-redshift galaxies from the Nasa Sloan Atlas, demonstrating its ability to distinguish physical mechanisms that are degenerate on the basis of correlations alone.

The causal structure of galactic astrophysics

TL;DR

The paper addresses the limitation of correlation-only analyses in astrophysics by applying causal discovery to a large sample of low-redshift galaxies. It develops and uses the FCIT algorithm to infer causal structure, calibrating with mock data from a Causal Perceptron Network and applying to NSA galaxies to reveal causal links such as mass driving size and morphology, and star formation shaping luminosity. The results demonstrate ~90% edge-recovery accuracy on mocks and identify a physically interpretable causal backbone, while highlighting latent confounders and observational biases that complicate interpretation. This approach provides a principled, direction-aware framework to constrain galaxy evolution theories and motivates further methodological and data enhancements.

Abstract

Data-driven astrophysics currently relies on the detection and characterisation of correlations between objects' properties, which are then used to test physical theories that make predictions for them. This process fails to utilise information in the data that forms a crucial part of the theories' predictions, namely which variables are directly correlated (as opposed to accidentally correlated through others), the directions of these determinations, and the presence or absence of confounders that correlate variables in the dataset but are themselves absent from it. We propose to recover this information through causal discovery, a well-developed methodology for inferring the causal structure of datasets that is however almost entirely unknown to astrophysics. We develop a causal discovery algorithm suitable for large astrophysical datasets and illustrate it on 5 low-redshift galaxies from the Nasa Sloan Atlas, demonstrating its ability to distinguish physical mechanisms that are degenerate on the basis of correlations alone.

Paper Structure

This paper contains 10 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Distributions and pairwise correlations of the NSA data used as input to the causal discovery algorithm. The contour levels contain 39.3, 86.5 and 98.9 per cent of the points (1, 2 and 3$\sigma$). The complex correlations necessitate a nonlinear correlation metric for assessing conditional independence.
  • Figure 2: The precision, recall and F1 statistics across 200 NSA-like mock datasets as a function of the penalty_discount, at truncation_limit$=14$. Solid lines show the mean over the datasets, and bands the 16$^\text{th}$ to 84$^\text{th}$ percentile range. A maximum reliability of $\sim$90 per cent is achieved at penalty_discount$\approx$50.
  • Figure 3: The PAG of the NSA data. Each node contains a colloquial parameter name in bold as well as the technical variable name in the NSA. Confident causal structures are indicated by directed edges, while less confident associations (circle endpoints) may be impacted by latent confounders.