Table of Contents
Fetching ...

Segmenting proto-halos with vision transformers

Toka Alokda, Cristiano Porciani

TL;DR

This study reframes proto-halo identification as a 3D semantic segmentation task on the initial density field and benchmarking two architectures: a CNN-based V-Net (copra) and a vision-transformer encoder with a CNN decoder (vipr). The transformer-based vipr consistently achieves higher accuracy and near-perfect AUC (0.98–0.99) across halo-mass classes, delivering sub-percent total-mass recovery and superior boundary fidelity compared with both copra and the perturbation-theory-based pinocchio model. Training with density alone or with the scalar tidal shear $T$ shows that the tidal field carries substantial predictive information and that combining fields can yield modest gains. Grad-CAM analyses provide qualitative insights into how the networks leverage input fields, though exact physical interpretation remains nontrivial. These results highlight vision transformers as powerful surrogates for fast, high-accuracy structure-formation modeling, with potential extensions to regression and larger-volume applications.

Abstract

The formation of dark-matter halos from small cosmological perturbations generated in the early universe is a highly non-linear process typically modeled through N-body simulations. In this work, we explore the use of deep learning to segment and classify proto-halo regions in the initial density field according to their final halo mass at redshift z=0. We compare two architectures: a fully convolutional neural network (CNN) based on the V-Net design and a U-Net transformer. We find that the transformer-based network significantly outperforms the CNN across all metrics, achieving sub-percent error in the total segmented mass per halo class. Both networks deliver much higher accuracy than the perturbation-theory-based model \textsc{pinocchio}, especially at low halo masses and in the detailed reconstruction of proto-halo boundaries. We also investigate the impact of different input features by training models on the density field, the tidal shear, and their combination. Finally, we use Grad-CAM to generate class-activation heatmaps for the CNN, providing preliminary yet suggestive insights into how the network exploits the input fields.

Segmenting proto-halos with vision transformers

TL;DR

This study reframes proto-halo identification as a 3D semantic segmentation task on the initial density field and benchmarking two architectures: a CNN-based V-Net (copra) and a vision-transformer encoder with a CNN decoder (vipr). The transformer-based vipr consistently achieves higher accuracy and near-perfect AUC (0.98–0.99) across halo-mass classes, delivering sub-percent total-mass recovery and superior boundary fidelity compared with both copra and the perturbation-theory-based pinocchio model. Training with density alone or with the scalar tidal shear shows that the tidal field carries substantial predictive information and that combining fields can yield modest gains. Grad-CAM analyses provide qualitative insights into how the networks leverage input fields, though exact physical interpretation remains nontrivial. These results highlight vision transformers as powerful surrogates for fast, high-accuracy structure-formation modeling, with potential extensions to regression and larger-volume applications.

Abstract

The formation of dark-matter halos from small cosmological perturbations generated in the early universe is a highly non-linear process typically modeled through N-body simulations. In this work, we explore the use of deep learning to segment and classify proto-halo regions in the initial density field according to their final halo mass at redshift z=0. We compare two architectures: a fully convolutional neural network (CNN) based on the V-Net design and a U-Net transformer. We find that the transformer-based network significantly outperforms the CNN across all metrics, achieving sub-percent error in the total segmented mass per halo class. Both networks deliver much higher accuracy than the perturbation-theory-based model \textsc{pinocchio}, especially at low halo masses and in the detailed reconstruction of proto-halo boundaries. We also investigate the impact of different input features by training models on the density field, the tidal shear, and their combination. Finally, we use Grad-CAM to generate class-activation heatmaps for the CNN, providing preliminary yet suggestive insights into how the network exploits the input fields.

Paper Structure

This paper contains 24 sections, 14 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Left: A dark-matter halo of mass $M=6.46\times 10^{12}\,h^{-1}$ M$_\odot$ at $z=0$, shown along with its immediate surroundings projected over a depth of $0.58\,h^{-1}\,\text{Mpc}$, corresponding approximately to the size of the structure. The red circle marks the outer boundary of the halo, defined as the radius enclosing a mean over-density of $\Delta=200$. The color of the N-body particles represents the local mass density, increasing logarithmically from blue to yellow, as estimated using an adaptive Gaussian kernel that contains 32 neighbors within its core. Right: The corresponding proto-halo patch at $z=99$ (shaded region with thick red boundary), overlaid on the projected linear over-density field with a depth of $9.37\,h^{-1}\,\text{Mpc}$, chosen to encompass the full Lagrangian extent of the proto-halo. Bright yellow and dark blue indicate the highest and lowest density regions, respectively. Note the difference in scales between the two panels: the material that initially spans several $h^{-1}$ comoving Mpc collapses into a compact structure of approximately $0.5\,h^{-1}$ Mpc in diameter.
  • Figure 2: Architecture diagram of the V-Net model copra used in this study. The network is illustrated for the case of $P=5$ output classes, corresponding to $M=2P=10$ base filters. The encoder–decoder structure includes convolutional and transposed-convolutional layers arranged in multi-layer blocks with skip connections, enabling effective learning of both global context and fine-grained spatial detail.
  • Figure 3: Schematic of ViT-based encoder used in vipr. The example illustrates the configuration for 5 output classes (corresponding to 4 mass bins plus a non-halo class). The input 3D volume is divided into patches, which are flattened, projected into a $K$-dimensional embedding space, and enriched with positional encodings. These embeddings are then processed through a series of transformer blocks. The resulting output is passed to a CNN-based decoder, which reconstructs voxel-wise classification probabilities.
  • Figure 4: Full UNETR architecture used in vipr. The base number of filters is set to $M=2P$, where $P$ is the number of output classes. The ViT encoder uses $M$ attention heads and an embedding size of $K=512$. In the CNN-based decoder, each convolutional block includes two instance normalization layers. Skip connections from selected transformer layers are integrated into the decoder to retain spatial context and support accurate volumetric segmentation.
  • Figure 5: Receiver Operating Characteristic (ROC) curves for the best-performing CNN-based model (copra, left) and ViT-based model (vipr, right), each trained with 7 output classes. The curves, computed on the validation set, show the binary classification performance for all classes; due to the high accuracy, the curves for different classes nearly overlap. Symbols indicate selected operating points for different classification thresholds, shown only for the non-halo (background) class. In the left panel, the symbols nearly coincide due to the sharpness of copra's score distribution.
  • ...and 9 more figures