Table of Contents
Fetching ...

CosmoGLINT: Cosmological Generative Model for Line Intensity Mapping with Transformer

Kana Moriwaki, Rui Lan Jun, Ken Osato, Naoki Yoshida

TL;DR

CosmoGLINT introduces a Transformer-based autoregressive generator trained on hydrodynamic simulations to populate DM-only haloes with galaxies, producing properties such as SFR, offsets, and velocities conditioned on halo mass $M$. It reproduces key LIM statistics, including voxel SFR distributions and real/redshift-space power spectra, and can produce multiple mock realizations and lightcones by applying the learned distributions to DM-only halo catalogues. The approach is demonstrated on IllustrisTNG data and extended to large-volume DM-only runs (e.g., Pinocchio) with halo-mass rescaling to mimic baryonic effects, enabling realistic, scalable LIM mocks for current and future surveys. Limitations related to subgrid physics and extrapolation to very massive haloes are discussed, with potential extensions to environment, concentration, metallicity, and multi-line emission to enhance realism and cross-survey analyses.

Abstract

Modelling star-forming galaxies is crucial for upcoming observations of large-scale matter and galaxy distributions with galaxy redshift surveys and line intensity mapping (LIM). We introduce CosmoGLINT (Cosmological Generative model for Line INtensity mapping with Transformer), a Transformer-based generative framework designed to create realistic galaxy populations from dark matter (DM)-only simulations. CosmoGLINT auto-regressively generates sequences of galaxy properties -- including star formation rate (SFR), distance to the halo centre, and radial and tangential velocities relative to the halo -- conditioned on halo mass. Trained on the IllustrisTNG hydrodynamic simulation, the model reproduces key statistical properties of the original data, including the voxel intensity distribution and the power spectrum both in real and redshift space. It can efficiently generate a number of different realisations of the designated galaxy populations, enabling the creation of mock LIM/redshift survey catalogues from large halo catalogues produced by fast DM-only simulations. We show that our model trained at multiple redshifts can be applied to DM halo lightcone data to generate a realistic mock galaxy lightcone with incorporating the redshift evolution of the galaxy population. The mock catalogues can be readily used to derive statistical quantities and to develop data analysis pipelines for ongoing and future wide-field surveys.

CosmoGLINT: Cosmological Generative Model for Line Intensity Mapping with Transformer

TL;DR

CosmoGLINT introduces a Transformer-based autoregressive generator trained on hydrodynamic simulations to populate DM-only haloes with galaxies, producing properties such as SFR, offsets, and velocities conditioned on halo mass . It reproduces key LIM statistics, including voxel SFR distributions and real/redshift-space power spectra, and can produce multiple mock realizations and lightcones by applying the learned distributions to DM-only halo catalogues. The approach is demonstrated on IllustrisTNG data and extended to large-volume DM-only runs (e.g., Pinocchio) with halo-mass rescaling to mimic baryonic effects, enabling realistic, scalable LIM mocks for current and future surveys. Limitations related to subgrid physics and extrapolation to very massive haloes are discussed, with potential extensions to environment, concentration, metallicity, and multi-line emission to enhance realism and cross-survey analyses.

Abstract

Modelling star-forming galaxies is crucial for upcoming observations of large-scale matter and galaxy distributions with galaxy redshift surveys and line intensity mapping (LIM). We introduce CosmoGLINT (Cosmological Generative model for Line INtensity mapping with Transformer), a Transformer-based generative framework designed to create realistic galaxy populations from dark matter (DM)-only simulations. CosmoGLINT auto-regressively generates sequences of galaxy properties -- including star formation rate (SFR), distance to the halo centre, and radial and tangential velocities relative to the halo -- conditioned on halo mass. Trained on the IllustrisTNG hydrodynamic simulation, the model reproduces key statistical properties of the original data, including the voxel intensity distribution and the power spectrum both in real and redshift space. It can efficiently generate a number of different realisations of the designated galaxy populations, enabling the creation of mock LIM/redshift survey catalogues from large halo catalogues produced by fast DM-only simulations. We show that our model trained at multiple redshifts can be applied to DM halo lightcone data to generate a realistic mock galaxy lightcone with incorporating the redshift evolution of the galaxy population. The mock catalogues can be readily used to derive statistical quantities and to develop data analysis pipelines for ongoing and future wide-field surveys.

Paper Structure

This paper contains 17 sections, 11 equations, 17 figures.

Figures (17)

  • Figure 1: Schematic picture of the model architecture. The model takes the halo mass $M$ and the sequence of galaxy properties $\bm{\theta}$ as inputs. These inputs are embedded into a high-dimensional latent space using neural networks $f$ and $g$. The resulting sequence is passed through a Transformer decoder, which outputs a sequence of predicted probability density functions $(p(\bm{\theta}_0), \dots, p(\bm{\theta}_{i}))$. Positional encodings (PE) are concatenated to the input sequence to enable the model to learn the order of the galaxies. For the prediction of $p(\bm{\theta}_{i})$, future elements, $\bm{\theta}_{i}$, $\bm{\theta}_{i+1}$, ..., are masked out and not used even if they are present in the input sequence. In the Transformer decoder, the self-attention mechanism allows all the values from $M$ to $\bm{\theta}_{i-1}$ to effectively contribute to the prediction of $p(\bm{\theta}_{i})$ within a few to several layers. When sampling new galaxies, the model generates them one by one, starting from the halo mass; at each step, the predicted probability distribution is used to sample the next galaxy property, as indicated by the dotted arrow.
  • Figure 2: Two examples of the predicted probability distributions of SFR for test haloes with $M \sim 10^{12} ~\rm M_\odot$. The inverted triangle indicates the true values, which are used to predict the probability distributions. The narrow panels on the left show the probabilities of sampling the values below $10^{-3}~\rm M_\odot/yr$, which are considered as end token, on a linear scale spanning from 0 at the bottom to 1 at the top. The numbers in the labels indicate order within haloes: 0 for the central, 1 for the first satellite, and so on. The top and bottom examples contain two and five galaxies, and the predicted probability densities are shown up to the third and sixth galaxies, respectively.
  • Figure 3: Two-dimensional distributions of central and first satellite SFRs for the TNG (black) and generated (red) catalogues. Haloes within $M \in [10^{12.4}, 10^{12.6}) ~\rm M_\odot$ are considered. The contours represent 0.5, 1, and 2-$\sigma$ levels.
  • Figure 4: SFR maps constructed from TNG galaxies (left) and galaxies generated from haloes in TNG (middle) and TNG-Dark (right). Each map has a pixel resolution of 0.59 Mpc. We do the projection over the half of the simulation volume ($302.6 \times 302.6 \times 151.3$ Mpc) that contains the test region. The region reserved as test data is indicated by the red box. The images are smoothed with a Gaussian kernel for visual clarity.
  • Figure 5: Power spectra of SFR maps of TNG haloes (light blue), TNG galaxies (black), and the mock galaxies generated from TNG haloes (orange) and TNG-Dark haloes (red) in the test region. The solid and dashed lines indicate the power spectra in real and redshift space, respectively.
  • ...and 12 more figures