Segmenting proto-halos with vision transformers
Toka Alokda, Cristiano Porciani
TL;DR
This study reframes proto-halo identification as a 3D semantic segmentation task on the initial density field and benchmarking two architectures: a CNN-based V-Net (copra) and a vision-transformer encoder with a CNN decoder (vipr). The transformer-based vipr consistently achieves higher accuracy and near-perfect AUC (0.98–0.99) across halo-mass classes, delivering sub-percent total-mass recovery and superior boundary fidelity compared with both copra and the perturbation-theory-based pinocchio model. Training with density alone or with the scalar tidal shear $T$ shows that the tidal field carries substantial predictive information and that combining fields can yield modest gains. Grad-CAM analyses provide qualitative insights into how the networks leverage input fields, though exact physical interpretation remains nontrivial. These results highlight vision transformers as powerful surrogates for fast, high-accuracy structure-formation modeling, with potential extensions to regression and larger-volume applications.
Abstract
The formation of dark-matter halos from small cosmological perturbations generated in the early universe is a highly non-linear process typically modeled through N-body simulations. In this work, we explore the use of deep learning to segment and classify proto-halo regions in the initial density field according to their final halo mass at redshift z=0. We compare two architectures: a fully convolutional neural network (CNN) based on the V-Net design and a U-Net transformer. We find that the transformer-based network significantly outperforms the CNN across all metrics, achieving sub-percent error in the total segmented mass per halo class. Both networks deliver much higher accuracy than the perturbation-theory-based model \textsc{pinocchio}, especially at low halo masses and in the detailed reconstruction of proto-halo boundaries. We also investigate the impact of different input features by training models on the density field, the tidal shear, and their combination. Finally, we use Grad-CAM to generate class-activation heatmaps for the CNN, providing preliminary yet suggestive insights into how the network exploits the input fields.
