Table of Contents
Fetching ...

Multi-modal Foundation Model for Cosmological Simulation Data

Bin Xia, Nesar Ramachandra, Azton I. Wells, Salman Habib, John Wise

TL;DR

Bridging cosmological simulations and observations is addressed by MOSAIC, an encoder-only Transformer trained on $185{,}247$ training samples and $20{,}583$ test samples from the Last Journey simulation, learning a unified representation across scalar ($z$, $M_{ m halo}$, $M_$) and vector (photometry, SFH, SED) modalities. The model uses masked regression with a dynamic masking scheme to enable cross-modal translation and missing-data imputation. It achieves $50\%$ and $63\%$ improvements in redshift and stellar-mass inferences when combining complementary modalities, and latent-space analyses reveal astrophysically meaningful clustering and correlations. The work lays groundwork for extending to higher-dimensional data and probabilistic decoding, enabling tighter integration of simulations and observations for future cosmological inference.

Abstract

We present a multi-modal foundation model for astrophysical galaxy data, designed to map between simulation- and observation-based galactic features. Our encoder-only transformer flexibly ingests scalar quantities (e.g., redshifts, galaxy masses) and vectors (e.g., star formation histories, spectra), supporting multi-task training that includes within-modality reconstruction and cross-modality prediction. With a dynamic masking strategy, the model can query arbitrary galaxy properties from partial inputs -- including predicting spectra from redshift and mass, or estimating photometric redshifts from broadband magnitudes -- while also recovering missing segments within a modality. Trained on 185,000 simulated galaxies from a gigaparsec-scale Cosmology simulation, the model yields a 50% improvement in redshift estimation when combining LSST and SPHEREx photometry over LSST photometry alone, and a 63% improvement in stellar mass inference when combining late-time SFH with LSST photometry over early-time SFH with LSST photometry. The model demonstrates strong generalization across multi-modal tasks and lays the groundwork for future integration of higher-dimensional and structured data such as images, merger trees, and 3D fields. This approach provides a unified framework for connecting simulations and observations, advancing the development of generalizable astrophysical foundation models.

Multi-modal Foundation Model for Cosmological Simulation Data

TL;DR

Bridging cosmological simulations and observations is addressed by MOSAIC, an encoder-only Transformer trained on training samples and test samples from the Last Journey simulation, learning a unified representation across scalar (, , ) and vector (photometry, SFH, SED) modalities. The model uses masked regression with a dynamic masking scheme to enable cross-modal translation and missing-data imputation. It achieves and improvements in redshift and stellar-mass inferences when combining complementary modalities, and latent-space analyses reveal astrophysically meaningful clustering and correlations. The work lays groundwork for extending to higher-dimensional data and probabilistic decoding, enabling tighter integration of simulations and observations for future cosmological inference.

Abstract

We present a multi-modal foundation model for astrophysical galaxy data, designed to map between simulation- and observation-based galactic features. Our encoder-only transformer flexibly ingests scalar quantities (e.g., redshifts, galaxy masses) and vectors (e.g., star formation histories, spectra), supporting multi-task training that includes within-modality reconstruction and cross-modality prediction. With a dynamic masking strategy, the model can query arbitrary galaxy properties from partial inputs -- including predicting spectra from redshift and mass, or estimating photometric redshifts from broadband magnitudes -- while also recovering missing segments within a modality. Trained on 185,000 simulated galaxies from a gigaparsec-scale Cosmology simulation, the model yields a 50% improvement in redshift estimation when combining LSST and SPHEREx photometry over LSST photometry alone, and a 63% improvement in stellar mass inference when combining late-time SFH with LSST photometry over early-time SFH with LSST photometry. The model demonstrates strong generalization across multi-modal tasks and lays the groundwork for future integration of higher-dimensional and structured data such as images, merger trees, and 3D fields. This approach provides a unified framework for connecting simulations and observations, advancing the development of generalizable astrophysical foundation models.

Paper Structure

This paper contains 11 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Schematic of MOSAIC architecture. Each modality is masked and projected into modality-specific embeddings, concatenated along the sequence dimension, combined with positional embeddings, and processed by a Transformer encoder. The sequence is split into modality-specific segments and fed into dedicated regression heads to produce predictions and compute regression loss.
  • Figure 2: Illustration of multi-modal prediction tasks and input masking configurations. Each row corresponds to a different input combination: (1) LSST magnitudes only, (2) LSST magnitudes + SPHEREx colors, (3) LSST magnitudes + early SFH (0--10 Gyr masked), (4) LSST magnitudes + late SFH (10--13.8 Gyr masked), (5) stellar mass only. White areas indicate provided data, and shaded areas indicate masked portions. For vector modalities, each subplot shows three samples with light solid lines for ground truth and dark dotted lines for predictions. For scalar modalities, each subplot displays scatter points for 1000 samples, with points near the diagonal indicating accurate predictions. Shaded bands indicate the 16–84% range ($\approx 1\sigma$) of the normalized mean absolute error (MA$\hat{\text{E}}$) computed over 1000 samples; unnormalized mean absolute error ($\overline{\text{MAE}}$ for vectors and MAE for scalars) is annotated for reference.
  • Figure 3: UMAP visualization of the last hidden state embeddings for five input configurations corresponding to Fig. \ref{['fig:results_schematic']}. Each subplot shows embeddings of five modalities (SFH, SED, SPHEREx, LSST, Scalars) for 10000 galaxies. Convex hulls are drawn around points between the 0.1 and 99.9 percentiles to reduce outlier effects, with a distinct color for each modality. Point colors encode normalized scalar ground-truth values (redshift, halo mass, stellar mass). For the first four configurations, embeddings form well-separated clusters corresponding to each modality, and within each cluster, points with similar physical properties (similar colors) are located close to each other, forming smooth color gradients. This indicates that the latent space captures astrophysical correlations even when input scalars are masked. For the fifth configuration, where only stellar mass is provided, clusters overlap and color continuity is absent, consistent with weaker predictive performance. The axes correspond to the two UMAP embedding dimensions, which are abstract and unitless.