Table of Contents
Fetching ...

Comparison of symbolic regression algorithms in Star/galaxy/quasar separation

Rachit Deshpande, Shantanu Desai

TL;DR

This work conducts a systematic comparative study of four state-of-the-art SR frameworks and shows that these models not only match the performance of traditional baselines but also provide a transparent, mathematically concise characterization of the astrophysical boundaries separating galactic and extragalactic populations.

Abstract

This work investigates symbolic regression (SR) as an interpretable alternative to black-box machine learning for the classification of stars, galaxies, and quasars in the Sloan Digital Sky Survey Data Release 17 (SDSS DR17). We conduct a systematic comparative study of four state-of-the-art SR frameworks: {\tt PySR}, Exhaustive Symbolic Regression ({\tt ESR}) with MDL-based selection, Physical Symbolic Optimization ({\tt PhySO}) using deep reinforcement learning, and Multi-View Symbolic Regression ({\tt MvSR}). By deriving compact analytic functions (complexity $\leq 10$) on a representative training subset and subsequently evaluating them via a 5-fold stratified cross-validation protocol on 100,000 spectroscopically confirmed objects, we map spectroscopic redshift ($z$) to continuous classification scores. Our results demonstrate that these low-complexity expressions achieve high predictive reliability, with {\tt MvSR} reaching a Cohen's Kappa of 0.8948 and {\tt PhySO} achieving exceptional parametric stability ($σ< 0.002$). We show that these models not only match the performance of traditional baselines but also provide a transparent, mathematically concise characterization of the astrophysical boundaries separating galactic and extragalactic populations.

Comparison of symbolic regression algorithms in Star/galaxy/quasar separation

TL;DR

This work conducts a systematic comparative study of four state-of-the-art SR frameworks and shows that these models not only match the performance of traditional baselines but also provide a transparent, mathematically concise characterization of the astrophysical boundaries separating galactic and extragalactic populations.

Abstract

This work investigates symbolic regression (SR) as an interpretable alternative to black-box machine learning for the classification of stars, galaxies, and quasars in the Sloan Digital Sky Survey Data Release 17 (SDSS DR17). We conduct a systematic comparative study of four state-of-the-art SR frameworks: {\tt PySR}, Exhaustive Symbolic Regression ({\tt ESR}) with MDL-based selection, Physical Symbolic Optimization ({\tt PhySO}) using deep reinforcement learning, and Multi-View Symbolic Regression ({\tt MvSR}). By deriving compact analytic functions (complexity ) on a representative training subset and subsequently evaluating them via a 5-fold stratified cross-validation protocol on 100,000 spectroscopically confirmed objects, we map spectroscopic redshift () to continuous classification scores. Our results demonstrate that these low-complexity expressions achieve high predictive reliability, with {\tt MvSR} reaching a Cohen's Kappa of 0.8948 and {\tt PhySO} achieving exceptional parametric stability (). We show that these models not only match the performance of traditional baselines but also provide a transparent, mathematically concise characterization of the astrophysical boundaries separating galactic and extragalactic populations.
Paper Structure (44 sections, 41 equations, 24 figures, 5 tables)

This paper contains 44 sections, 41 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: Class distribution of the SDSS DR17 dataset across GALAXY, STAR, and QSO categories.
  • Figure 2: Normalized redshift distributions of stars, galaxies, and quasars shown on a logarithmic scale.
  • Figure 3: $u - g$ color as a function of redshift for stars, galaxies, and quasars, highlighting population-specific trends.
  • Figure 4: Unified experimental pipeline for SDSS DR17 classification.
  • Figure 5: Pareto-optimal frontier for the PySR discovery phase. The plot illustrates the trade-off between functional complexity and $L_2$ loss. A significant performance gain is observed at $C=6$, corresponding to the introduction of the exponential ($\exp$) operator, which captures the non-linear population transitions. The final scoring function $s(z)$ (indicated by the star) was selected at the complexity limit of $C=10$ to maximize representational accuracy while maintaining interpretability, following the scaling law selection criteria discussed in Darc.
  • ...and 19 more figures