Table of Contents
Fetching ...

Initial Conditions from Galaxies: Machine-Learning Subgrid Correction to Standard Reconstruction

Liam Parker, Adrian E. Bayer, Uros Seljak

TL;DR

We address reconstructing the primordial density from late-time biased tracers by coupling standard BAO reconstruction with a learned subgrid CNN correction that operates on small, manageable subvolumes and tiles across the full survey volume. The method is trained on Quijote halos and galaxy mocks in both real and redshift space, achieving improved cross-correlation with the true initial field and substantially tighter BAO constraints than standard reconstruction, across volume scales. Key contributions include the sliding-window CNN architecture, Fourier-space loss, robust transfer to larger volumes without retraining, and demonstrated resilience to HOD misspecification. The approach yields scalable, high-fidelity reconstruction that can enhance cosmological analyses for DESI-like surveys by capturing nonlinearities and bias without sacrificing large-scale accuracy.

Abstract

We present a hybrid method for reconstructing the primordial density from late-time halos and galaxies. Our approach involves two steps: (1) apply standard Baryon Acoustic Oscillation (BAO) reconstruction to recover the large-scale features in the primordial density field and (2) train a deep learning model to learn small-scale corrections on partitioned subgrids of the full volume. At inference, this correction is then convolved across the full survey volume, enabling scaling to large survey volumes. We train our method on both mock halo catalogs and mock galaxy catalogs in both configuration and redshift space from the Quijote $1(h^{-1}\,\mathrm{Gpc})^3$ simulation suite. When evaluated on held-out simulations, our combined approach significantly improves the reconstruction cross-correlation coefficient with the true initial density field and remains robust to moderate model misspecification. Additionally, we show that models trained on $1(h^{-1}\,\mathrm{Gpc})^3$ can be applied to larger boxes--e.g., $(3h^{-1}\,\mathrm{Gpc})^3$--without retraining. Finally, we perform a Fisher analysis on our method's recovery of the BAO peak, and find that it significantly improves the error on the acoustic scale relative to standard BAO reconstruction. Ultimately, this method robustly captures nonlinearities and bias without sacrificing large-scale accuracy, and its flexibility to handle arbitrarily large volumes without escalating computational requirements makes it especially promising for large-volume surveys like DESI.

Initial Conditions from Galaxies: Machine-Learning Subgrid Correction to Standard Reconstruction

TL;DR

We address reconstructing the primordial density from late-time biased tracers by coupling standard BAO reconstruction with a learned subgrid CNN correction that operates on small, manageable subvolumes and tiles across the full survey volume. The method is trained on Quijote halos and galaxy mocks in both real and redshift space, achieving improved cross-correlation with the true initial field and substantially tighter BAO constraints than standard reconstruction, across volume scales. Key contributions include the sliding-window CNN architecture, Fourier-space loss, robust transfer to larger volumes without retraining, and demonstrated resilience to HOD misspecification. The approach yields scalable, high-fidelity reconstruction that can enhance cosmological analyses for DESI-like surveys by capturing nonlinearities and bias without sacrificing large-scale accuracy.

Abstract

We present a hybrid method for reconstructing the primordial density from late-time halos and galaxies. Our approach involves two steps: (1) apply standard Baryon Acoustic Oscillation (BAO) reconstruction to recover the large-scale features in the primordial density field and (2) train a deep learning model to learn small-scale corrections on partitioned subgrids of the full volume. At inference, this correction is then convolved across the full survey volume, enabling scaling to large survey volumes. We train our method on both mock halo catalogs and mock galaxy catalogs in both configuration and redshift space from the Quijote simulation suite. When evaluated on held-out simulations, our combined approach significantly improves the reconstruction cross-correlation coefficient with the true initial density field and remains robust to moderate model misspecification. Additionally, we show that models trained on can be applied to larger boxes--e.g., --without retraining. Finally, we perform a Fisher analysis on our method's recovery of the BAO peak, and find that it significantly improves the error on the acoustic scale relative to standard BAO reconstruction. Ultimately, this method robustly captures nonlinearities and bias without sacrificing large-scale accuracy, and its flexibility to handle arbitrarily large volumes without escalating computational requirements makes it especially promising for large-volume surveys like DESI.

Paper Structure

This paper contains 29 sections, 18 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Reconstruction of the linear dark matter field from late-time halos at $\mathbf{z=0.5}$ in configuration space for two number densities. Standard reconstruction is shown in gray, while our CNN-corrected approach is shown in orange. Solid lines indicate mass-weighted halos and dashed lines uniform number-weighting. The left panels compare the reconstructed power spectra to the true linear power (black), the middle panels show the transfer function $T(k)$, and the right panels present the cross-correlation coefficient $r(k)$. Standard reconstruction and our CNN correction converge on large scales, however our method consistently improves the recovered amplitude and correlation on intermediate to small scales where non-linearities dominate.
  • Figure 2: Redshift-space distortion (RSD) analysis for halo catalogs at $\mathbf{\bar{n} = 5\times 10^{-4}}$. In the top panel, the network takes as input mass-weighted halos, while in the bottom panel, it takes as input number-weighted halos. In each sub-panel, linestyles correspond to different $\mu$-bins ranging from mostly transverse modes ($\mu \approx 0.17$) to nearly line-of-sight modes ($\mu \approx 0.83$) relative to the RSD direction. Orange lines show results from our CNN-corrected reconstruction, while gray lines depict the standard reconstruction, and the blue line indicates the reconstruction performance on configuration-space fields. Our method's improvement over standard reconstruction is consistent even at higher $\mu$ where redshift-space effects dominate.
  • Figure 3: Impact of halo mass scatter on reconstruction performance in redshift space at $\mathbf{\bar{n} = 5 \times 10^{-4}}$. The horizontal axis indicates the RMS scatter added to each halo’s mass (from 0.0 up to 1.1), and the vertical axis plots the resulting transfer function ($T(k)$, left) or cross-correlation ($r(k)$, right) at characteristic scales $k = 0.1,\,0.2,\,0.3,\,0.4\,h^{-1}\,\mathrm{Mpc}$. Orange lines (“Noised”) show how the reconstruction degrades when masses are randomly perturbed, whereas blue points (“Number-Weighted”) depict uniform-weighted halos. Notably, sullivan2023learning report that using observables such as color and galaxy positions one can achieve mass uncertainties around $\text{RMS}\!\approx\!0.24$ for future surveys, placing them in the regime where a reliable mass proxy still confers appreciable gains.
  • Figure 4: Reconstruction of the linear dark matter field from galaxies at $\mathbf{z=0.5}$ in configuration space for two number densities. For each primordial density field, we show reconstructions from galaxy fields generated using five different HOD parameter draws within the hahn2023simbig HOD model. Standard reconstruction is shown in gray, while our CNN-corrected approach is shown in orange. Notably, the increased stochasticity arising from the HOD—such as the assignment of satellites and variability in halo-occupation thresholds—does not significantly degrade performance compared to uniformly weighted halos.
  • Figure 5: Redshift-space distortion analysis for galaxy catalogs with $\mathbf{\bar{n} = 5\times 10^{-4}}$. We show the anisotropic transfer function ($T(k)$, left) and cross-correlation coefficient ($r(k,\mu)$, right) with the true primordial dark matter field in bins of wavenumber $k$ and angle $\mu$ relative to the line of sight for the number density $\bar{n} = 5\times10^{-4}$. As with the halo-based RSD analyses, standard reconstruction is shown in gray, while our CNN-corrected approach is shown in orange, and the configuration-space CNN-correction is shown in blue for reference. Across all $\mu$-bins, the CNN-correction recovers the primordial density more accurately than the baseline.
  • ...and 4 more figures