Table of Contents
Fetching ...

Learning the Cosmic Web: Graph-based Classification of Simulated Galaxies by their Dark Matter Environments

Dakshesh Kololgi, Krishna Naidoo, Amelie Saintonge, Ofer Lahav

TL;DR

The study tackles the challenge of robustly classifying galaxies by their dark matter cosmic web environments. It introduces a three-stage framework that combines Hessian-based T-Web labeling of the density field, a Delaunay graph representation of galaxy positions with ten node features, and a Graph Attention Network (GAT+) to predict the four environments (void, wall, filament, cluster). On IllustrisTNG-300 galaxies with $M_* > 10^9\,M_{\odot}$, the GAT+ model achieves $85\%$ test accuracy, outperforming MLP and GCN baselines, with mutual information highlighting the clustering coefficient as particularly informative. The learned embeddings reveal clearer environment separation than the raw graph metrics, and the results underscore the potential of graph-based approaches to bridge simulations and large observational surveys like DESI through domain adaptation.

Abstract

We present a novel graph-based machine learning classifier for identifying the dark matter cosmic web environments of galaxies. Large galaxy surveys offer comprehensive statistical views of how galaxy properties are shaped by large-scale structure, but this requires robust classifications of galaxies' cosmic web environments. Using stellar mass-selected IllustrisTNG-300 galaxies, we apply a three-stage, simulation-based framework to link galaxies to the total (mainly dark) underlying matter distribution. Here, we apply the following three steps: First, we assign the positions of simulated galaxies to a void, wall, filament, or cluster environment using the T-Web classification of the underlying matter distribution. Second, we construct a Delaunay triangulation of the galaxy distribution to summarise the local geometric structure with ten graph metrics for each galaxy. Third, we train a graph attention network (GAT) on each galaxy's graph metrics to predict its cosmic web environment. For galaxies with stellar mass $\mathrm{>10^9 M_{\odot}}$, our GAT+ model achieves an accuracy of $85\,\%$, outperforming graph-agnostic multilayer perceptrons and graph convolutional networks. Our results demonstrate that graph-based representations of galaxy positions provide a powerful and physically meaningful way to infer dark matter environments. We plan to apply this simulation-based graph modelling to investigate how the properties of observed galaxies from the Dark Energy Spectroscopic Instrument (DESI) survey are influenced by their dark matter environments.

Learning the Cosmic Web: Graph-based Classification of Simulated Galaxies by their Dark Matter Environments

TL;DR

The study tackles the challenge of robustly classifying galaxies by their dark matter cosmic web environments. It introduces a three-stage framework that combines Hessian-based T-Web labeling of the density field, a Delaunay graph representation of galaxy positions with ten node features, and a Graph Attention Network (GAT+) to predict the four environments (void, wall, filament, cluster). On IllustrisTNG-300 galaxies with , the GAT+ model achieves test accuracy, outperforming MLP and GCN baselines, with mutual information highlighting the clustering coefficient as particularly informative. The learned embeddings reveal clearer environment separation than the raw graph metrics, and the results underscore the potential of graph-based approaches to bridge simulations and large observational surveys like DESI through domain adaptation.

Abstract

We present a novel graph-based machine learning classifier for identifying the dark matter cosmic web environments of galaxies. Large galaxy surveys offer comprehensive statistical views of how galaxy properties are shaped by large-scale structure, but this requires robust classifications of galaxies' cosmic web environments. Using stellar mass-selected IllustrisTNG-300 galaxies, we apply a three-stage, simulation-based framework to link galaxies to the total (mainly dark) underlying matter distribution. Here, we apply the following three steps: First, we assign the positions of simulated galaxies to a void, wall, filament, or cluster environment using the T-Web classification of the underlying matter distribution. Second, we construct a Delaunay triangulation of the galaxy distribution to summarise the local geometric structure with ten graph metrics for each galaxy. Third, we train a graph attention network (GAT) on each galaxy's graph metrics to predict its cosmic web environment. For galaxies with stellar mass , our GAT+ model achieves an accuracy of , outperforming graph-agnostic multilayer perceptrons and graph convolutional networks. Our results demonstrate that graph-based representations of galaxy positions provide a powerful and physically meaningful way to infer dark matter environments. We plan to apply this simulation-based graph modelling to investigate how the properties of observed galaxies from the Dark Energy Spectroscopic Instrument (DESI) survey are influenced by their dark matter environments.

Paper Structure

This paper contains 23 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Two-dimensional slice of the Delaunay triangulation constructed from the spatial distribution of IllustrisTNG galaxies. The slice is $1\,\mathrm{Mpc}$ thick in the X-Y plane at $Z = 150\,\mathrm{Mpc}$, and shows the network of edges connecting neighbouring galaxies above a stellar mass cut of $10^{9}\,M_{\odot}$. The Delaunay graph captures the underlying geometric structure of the galaxy distribution, naturally tracing voids, walls, filaments, and clusters within the cosmic web. The $10\,\mathrm{Mpc}$-thick hatched region around the box is the buffer region. Galaxies in this region have unphysical graph metrics due to the spuriously high density of edges resulting from the simulation boundaries.
  • Figure 2: Distributions of mean edge lengths for each IllustrisTNG galaxy (above a $\mathrm{10^9 \,M_{\odot}}$ stellar mass cut). The edges are determined by the Delaunay triangulation graph. The mean edge lengths are scaled using a Box-Cox power transform, as are all the graph metrics, to make them more Gaussian-like for faster training and better numerical stability. This example demonstrates that even a single graph metric can partially discriminate the four cosmic web environments, motivating the use of a combination of graph metrics to capture richer structural information.
  • Figure 3: The mutual information ross_mutual_2014kraskov_estimating_2004 quantifies the statistical dependency between each node-level graph metric and the categorical environments (void, wall, filament, cluster) defined by the T-WEB classifier. Mutual information captures both non-linear and non-monotonic relationships. The clustering coefficient and neighbour density exhibit the strongest association with cosmic web environments, indicating their higher discriminative power for capturing relevant local structures. At the same time, degree and minimum edge length show relatively weaker dependencies.
  • Figure 4: Overview of the baseline and graph-based neural network architectures explored in this work. The MLP (left) serves as the baseline model, taking node features as independent inputs passed through successive fully-connected layers. The GCN (middle) introduces relational inductive biases by aggregating information from connected nodes in the Delaunay graph, enabling feature propagation along edges. The GAT+ (right) extends this by applying multi-head attention mechanisms and edge features, allowing the network to learn the relative importance of neighbouring nodes. The architectures were refined through iterative experimentation, adjusting the number of layers, hidden dimensions, and normalisation or dropout configurations until convergence performance and stability were optimised across validation runs.
  • Figure 5: Training and validation performance of the GAT+ model. Accuracy (left) and loss (right) curves are shown for the training and validation datasets over 10,000 epochs for the $\mathrm{10^{9}\,M_{\odot}}$ stellar mass cut. The convergence and close overlap between training and validation curves indicate stable optimisation and minimal overfitting.
  • ...and 2 more figures