Table of Contents
Fetching ...

A Probabilistic Autoencoder for Galaxy SED Reconstruction and Redshift Estimation: Application to Mock SPHEREx Spectrophotometry

Richard M. Feder, Liam Parker, Uroš Seljak

Abstract

We present a probabilistic autoencoder (PAE) framework for galaxy spectral energy distribution (SED) modeling and redshift estimation, applied to synthetic SPHEREx 102-band spectrophotometry. Our PAE learns a compact latent representation of rest-frame galaxy SEDs transformed to a simple Gaussian base density using a normalizing flow, combined with an explicit forward model enabling joint Bayesian inference over intrinsic SED parameters and redshift with well-defined priors. In controlled tests on simulated SPHEREx spectra, our PAE improves on template fitting (TF) in source recovery, outlier rate, and posterior calibration, with trade-offs in redshift performance that depend on the assumed priors. A simple cut on the ratio of PAE and TF uncertainties identifies sources that are overwhelmingly TF outliers, which can be used to clean existing TF samples while retaining the vast majority of well-recovered sources. By directly profiling over PAE latent variables, we show these cases correspond to shallow likelihood surfaces where the PAE's continuous SED manifold produces broader likelihoods that more faithfully reflect the lack of constraining power in the data, whereas the TF discrete model grid yields artificially confident but incorrect redshift estimates. Lastly, we present an alternative, simulation-based inference approach using a Transformer encoder and conditional normalizing flow, which provides similar redshift performance to the PAE but with $\sim200\times$ faster inference throughput. Our implementation, \texttt{PAESpec}, is publicly available and provides a foundation for principled redshift estimation in modern photometric surveys.

A Probabilistic Autoencoder for Galaxy SED Reconstruction and Redshift Estimation: Application to Mock SPHEREx Spectrophotometry

Abstract

We present a probabilistic autoencoder (PAE) framework for galaxy spectral energy distribution (SED) modeling and redshift estimation, applied to synthetic SPHEREx 102-band spectrophotometry. Our PAE learns a compact latent representation of rest-frame galaxy SEDs transformed to a simple Gaussian base density using a normalizing flow, combined with an explicit forward model enabling joint Bayesian inference over intrinsic SED parameters and redshift with well-defined priors. In controlled tests on simulated SPHEREx spectra, our PAE improves on template fitting (TF) in source recovery, outlier rate, and posterior calibration, with trade-offs in redshift performance that depend on the assumed priors. A simple cut on the ratio of PAE and TF uncertainties identifies sources that are overwhelmingly TF outliers, which can be used to clean existing TF samples while retaining the vast majority of well-recovered sources. By directly profiling over PAE latent variables, we show these cases correspond to shallow likelihood surfaces where the PAE's continuous SED manifold produces broader likelihoods that more faithfully reflect the lack of constraining power in the data, whereas the TF discrete model grid yields artificially confident but incorrect redshift estimates. Lastly, we present an alternative, simulation-based inference approach using a Transformer encoder and conditional normalizing flow, which provides similar redshift performance to the PAE but with faster inference throughput. Our implementation, \texttt{PAESpec}, is publicly available and provides a foundation for principled redshift estimation in modern photometric surveys.

Paper Structure

This paper contains 29 sections, 14 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Schematic describing the training (left) and inference (right) stages of our probabilistic autoencoder model. The variables $\pmb{\zeta}$, $\mathbf{u}$ and $z$ denote the autoencoder latent vector, normalizing flow basis vector, and redshift, respectively. The blue and red arrows specify separate forward passes used when training the autoencoder and the normalizing flow. We note that additional priors beyond the NF prior (e.g., on redshift) can be folded into the PAE inference.
  • Figure 2: Number counts in $dz=0.1$ bins from our mock COSMOS sample (blue), alongside our best-fit parametric model for $p(z)$. Note that, due to the relatively small COSMOS-2020 footprint, sample variance uncertainties on the measured $N(z)$ are significant, i.e., our fitted prior is fairly approximate.
  • Figure 3: Rest-frame SED reconstruction performance of our trained autoencoder. The left panel shows the distribution of per-galaxy mean-squared errors (MSE) from our validation set, while the right panel shows the average MSE as a function of wavelength. The vertical lines correspond to the position of known emission lines, while the shaded band covers both the 3.3 $\mu$m and 3.4 $\mu$m aliphatic shoulder from polycyclic aromatic hydrocarbons (PAHs).
  • Figure 4: An example comparison of the autoencoder latent variable distribution $p(\pmb{\zeta})$ (left) and normalizing flow latent distribution $p(\mathbf{u})$, for the case $n_{\rm latent}=5$. The normalizing flow transforms $p(\pmb{\zeta})$, which is highly non-Gaussian and multimodal, into a simpler base distribution, though some residual structure remains.
  • Figure 5: Normalizing flow latent vector norms for 10000 galaxies in our validation set as a function of redshift. The mean trend (black) shows a mild dependence of $||u||_2$ with redshift between $z=0$ and $z=3$, with Pearson correlation coefficient $r=0.132$.
  • ...and 12 more figures