Table of Contents
Fetching ...

The PAU Survey: Uncovering the connection between intrinsic and observed galaxy properties using symbolic regression

Adarsh Kumar, Carlton M. Baugh, Suttikoon Koonkor, Giorgio Manzoni, Sukanta Panda, D. Navarro Girones, R. Casas, J. Carretero, F. Castander, J. De Vicente, J. Garcia Bellido, E. Gaztanaga, R. Miquel, P. Renard, P. Tallada Crespi

TL;DR

This work tackles the challenge of rapidly and accurately estimating galaxy stellar masses from photometry and redshifts in the era of massive surveys. It first benchmarks a deep neural network and then derives explicit, interpretable mass–observable relations via symbolic regression using a GALFORMPAUS mock, restricting to linear combinations of four observables to maximize interpretability and speed. The resulting expressions reproduce masses with accuracy comparable to SED fitting in the bulk of the population, while remaining robust to observational noise and offering instantaneous evaluation for millions of galaxies; comparison with PAUS/CIGALE masses shows good agreement within ~0.1 dex for $M_* > 10^8\,M_\\odot$. The method enables fast construction of the stellar-mass function and offers a transparent alternative to traditional SED-based approaches, albeit with systematic biases at the mass extremes and limited transferability to other surveys without re-training. Overall, the study demonstrates that simple, physically interpretable formulas can approximate complex SED methods and substantially accelerate large-scale galaxy surveys.

Abstract

Estimating stellar masses for billions of galaxies in upcoming surveys requires methods that are both accurate and computationally efficient. We present a new approach using symbolic regression trained on a simulation to derive simple, explicit mathematical expressions that estimate galaxy stellar masses from basic observables: photometry and redshift. Using a mock catalogue from the GALFORM semi-analytical model that reproduces the Physics of the Accelerating Universe Survey (PAUS), we show that a linear combination of just four observables -- minimally processed $u$- and $i$- band magnitudes, observed $(g-r)$ colour, and redshift -- can recover stellar masses with accuracy comparable to traditional spectral energy distribution (SED) fitting, but with negligible computational cost. Our expressions can be evaluated instantaneously for millions of galaxies, making them ideal for next-generation surveys like LSST and Euclid. When observational errors are included, symbolic regression achieves a similar accuracy to deep neural networks while maintaining transparency. Validation against CIGALE SED fitting on PAUS data shows agreement within 0.13 dex for galaxies with $M_{*} > 10^8 M_{\odot}$. We demonstrate that the stellar mass function can be recovered at $z < 0.5$, though with distortions at the extremes: the high-mass end is overestimated by a factor of $\sim 3$ at $10^{11.5} h^{-1} M_{\odot}$ due to scatter. Our approach offers a fast, transparent alternative to traditional methods without sacrificing accuracy for the bulk of the galaxy population.

The PAU Survey: Uncovering the connection between intrinsic and observed galaxy properties using symbolic regression

TL;DR

This work tackles the challenge of rapidly and accurately estimating galaxy stellar masses from photometry and redshifts in the era of massive surveys. It first benchmarks a deep neural network and then derives explicit, interpretable mass–observable relations via symbolic regression using a GALFORMPAUS mock, restricting to linear combinations of four observables to maximize interpretability and speed. The resulting expressions reproduce masses with accuracy comparable to SED fitting in the bulk of the population, while remaining robust to observational noise and offering instantaneous evaluation for millions of galaxies; comparison with PAUS/CIGALE masses shows good agreement within ~0.1 dex for . The method enables fast construction of the stellar-mass function and offers a transparent alternative to traditional SED-based approaches, albeit with systematic biases at the mass extremes and limited transferability to other surveys without re-training. Overall, the study demonstrates that simple, physically interpretable formulas can approximate complex SED methods and substantially accelerate large-scale galaxy surveys.

Abstract

Estimating stellar masses for billions of galaxies in upcoming surveys requires methods that are both accurate and computationally efficient. We present a new approach using symbolic regression trained on a simulation to derive simple, explicit mathematical expressions that estimate galaxy stellar masses from basic observables: photometry and redshift. Using a mock catalogue from the GALFORM semi-analytical model that reproduces the Physics of the Accelerating Universe Survey (PAUS), we show that a linear combination of just four observables -- minimally processed - and - band magnitudes, observed colour, and redshift -- can recover stellar masses with accuracy comparable to traditional spectral energy distribution (SED) fitting, but with negligible computational cost. Our expressions can be evaluated instantaneously for millions of galaxies, making them ideal for next-generation surveys like LSST and Euclid. When observational errors are included, symbolic regression achieves a similar accuracy to deep neural networks while maintaining transparency. Validation against CIGALE SED fitting on PAUS data shows agreement within 0.13 dex for galaxies with . We demonstrate that the stellar mass function can be recovered at , though with distortions at the extremes: the high-mass end is overestimated by a factor of at due to scatter. Our approach offers a fast, transparent alternative to traditional methods without sacrificing accuracy for the bulk of the galaxy population.

Paper Structure

This paper contains 22 sections, 7 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: The distribution of observed $g-r$ colour for galaxies with $i_{\rm AB} < 22.5$ in selected redshift slices: top $0.20<z<0.25$, middle $0.35<z<0.40$, bottom $0.85<z<0.90$. The solid lines show the PAUS observations, the dotted lines show the GALFORM mock without photometry errors and the dashed lines show the mock with photometry errors included. The curves have been normalised to enclose the same area. The line colours have no meaning, and have been chosen for artistic merit.
  • Figure 2: The distribution of stellar masses for galaxies in the PAUS mock with $i_{\rm AB} < 22.5$. The blue curve shows all galaxies, the orange curve galaxies with $z<0.55$ and the green curve $z>0.55$. The $y$-axis gives the raw number of galaxies in each bin, without adjustment for the bin width.
  • Figure 3: The median stellar mass as a function of the minimal absolute magnitude (see text), plotted in narrow redshift slices: top - $0.12<z<0.13$, middle - $0.52<z<0.53$, bottom - $1.72<z<1.82$. The larger symbols show the median stellar mass and the bars shows the $25^{\rm th}$ to $75^{\rm th}$ percentile range; when there is a small number of galaxies in a bin, the data points are not shown. Points are offset along the $x$-axis for clarity. The red points and lines are for the observed $i$-band and the purple points and lines for the observed $u$-band. In the middle panel we also show the individual galaxies, with purple dots showing the $u$-band and red dots the $i$-band, to help with the interpretation of the shapes of the median curves.
  • Figure 4: The difference in the logarithm of the estimated stellar mass and that of the true stellar mass, plotted as a function of the log of the true stellar mass. Here we are comparing the performance of the ANN using 6 input parameters against using 4 input parameters, for both red and blue galaxies. There are 177416 red galaxies and 297996 blue galaxies. The bars show the 25-75 percentile spread of the predicted masses. There is a modest increase in the scatter on the estimated stellar masses when restricting attention to four inputs. There is a tendancy for the most massive blue galaxies to have their stellar mass underestimated.
  • Figure 5: The performance of the ANN for low and high redshift samples, when trained for (i) a single sample covering the whole redshift range and (ii) for low and high redshift samples separately. The top panel shows the results for red galaxies and the bottom panel those for blue galaxies. The metric value is labelled for each case and the bias in the estimated stellar mass is plotted as a function of the true stellar mass. The points show the median residual and the bars the 25-75 percentile range. The key shows the colours used to denote each sample; these colours are also used to report the metric values. In the top panel, the red and purple lines show results for the ANN trained using all redshifts, when applied to all galaxies (red) and just to the high redshift sample (purple). Red should be compared with gold, which shows the low redshift galaxy predictions for an ANN trained using all galaxies. In the bottom panel the equivalent comparison is between blue and black points.
  • ...and 8 more figures