Table of Contents
Fetching ...

Vision Transformers for Cosmological Fields: Application to Weak Lensing Mass Maps

Jash Kakadia, Shubh Agrawal, Kunhao Zhong, Bhuvnesh Jain

TL;DR

This work assesses whether attention-based vision models can extract non-Gaussian information from weak-lensing mass maps to constrain $Ω_m$ and $S_8$ using simulation-based inference (SBI). It compares Vision Transformers (ViT) and Swin Transformers against CNN baselines on convergence maps from DarkGridV1, incorporating tomographic channels and pre-training on synthetic data. The Swin Transformer generally outperforms vanilla ViT, particularly with limited training data, yet the cosmological Figure of Merit under realistic shape noise remains comparable to CNNs, with pre-training substantially boosting transformer performance. The results suggest transformers offer interpretability advantages and potential gains with more data or improved pre-training, but do not yet surpass CNNs in this realistic setting for cosmological parameter inference.

Abstract

Weak gravitational lensing is a powerful probe of the universe's growth history. While traditional two-point statistics capture only the Gaussian features of the convergence field, deep learning methods such as convolutional neural networks (CNNs) have shown promise in extracting non-Gaussian information from small-scale, nonlinear structures. In this work, we evaluate the effectiveness of attention-based architectures, including variants of vision transformers (ViTs) and shifted window (Swin) transformers, in constraining the cosmological parameters $Ω_m$ and $S_8$ from weak lensing mass maps. Using a simulation-based inference (SBI) framework, we compare transformer-based methods to CNNs. We also examine performance scaling with the number of available $N$-body simulations, highlighting the importance of pre-training for transformer architectures. We find that the Swin transformer performs significantly better than vanilla ViTs, especially with limited training data. Despite their higher representational capacity, the Figure of Merit for cosmology achieved by transformers is comparable to that of CNNs under realistic noise conditions.

Vision Transformers for Cosmological Fields: Application to Weak Lensing Mass Maps

TL;DR

This work assesses whether attention-based vision models can extract non-Gaussian information from weak-lensing mass maps to constrain and using simulation-based inference (SBI). It compares Vision Transformers (ViT) and Swin Transformers against CNN baselines on convergence maps from DarkGridV1, incorporating tomographic channels and pre-training on synthetic data. The Swin Transformer generally outperforms vanilla ViT, particularly with limited training data, yet the cosmological Figure of Merit under realistic shape noise remains comparable to CNNs, with pre-training substantially boosting transformer performance. The results suggest transformers offer interpretability advantages and potential gains with more data or improved pre-training, but do not yet surpass CNNs in this realistic setting for cosmological parameter inference.

Abstract

Weak gravitational lensing is a powerful probe of the universe's growth history. While traditional two-point statistics capture only the Gaussian features of the convergence field, deep learning methods such as convolutional neural networks (CNNs) have shown promise in extracting non-Gaussian information from small-scale, nonlinear structures. In this work, we evaluate the effectiveness of attention-based architectures, including variants of vision transformers (ViTs) and shifted window (Swin) transformers, in constraining the cosmological parameters and from weak lensing mass maps. Using a simulation-based inference (SBI) framework, we compare transformer-based methods to CNNs. We also examine performance scaling with the number of available -body simulations, highlighting the importance of pre-training for transformer architectures. We find that the Swin transformer performs significantly better than vanilla ViTs, especially with limited training data. Despite their higher representational capacity, the Figure of Merit for cosmology achieved by transformers is comparable to that of CNNs under realistic noise conditions.

Paper Structure

This paper contains 6 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The vision model prediction of the parameter $S_8=\sigma_8\sqrt{\Omega_\mathrm{m}/0.3}$. The first two plots show the prediction vs the true value, and the last two plots show the histogram of the residuals. The statistical measures are reported as root mean-square error and $R^2$ values, and we summarize them in Appendix \ref{['appendix:model_exploration']}. Note that NoNoise is an exploratory case with only one channel, and LSST-Y1 Like contains the maps with 4 channels, as detailed in Section \ref{['sec:methodology']}.
  • Figure 2: Left: SBI posteriors obtained from our NDE setup detailed in Section \ref{['sec:methodology']}, for a CNN (blue) and a Swin-Transformer (red) for cosmological weak lensing $\kappa$ fields. Right: Results from TARP lemos2023samplingbasedaccuracytestingposterior, which compares empirical coverage against credibility level to gauge posterior performance. Posteriors obtained on the test set from our NDE setup yield curves closely aligned with the identity line, indicating that neither of the posteriors is under- or over-constrained.
  • Figure 3: RMSE and $R^2$ for $S_8$ as a function of the training data fraction. Pre-training significantly improves the performance of attention-based models, especially in low-data regimes. In contrast, CNN baselines—shown without pre-training—exhibit stronger regularization and more stable performance.