Comparison of symbolic regression algorithms in Star/galaxy/quasar separation

Rachit Deshpande; Shantanu Desai

Comparison of symbolic regression algorithms in Star/galaxy/quasar separation

Rachit Deshpande, Shantanu Desai

TL;DR

This work conducts a systematic comparative study of four state-of-the-art SR frameworks and shows that these models not only match the performance of traditional baselines but also provide a transparent, mathematically concise characterization of the astrophysical boundaries separating galactic and extragalactic populations.

Abstract

This work investigates symbolic regression (SR) as an interpretable alternative to black-box machine learning for the classification of stars, galaxies, and quasars in the Sloan Digital Sky Survey Data Release 17 (SDSS DR17). We conduct a systematic comparative study of four state-of-the-art SR frameworks: {\tt PySR}, Exhaustive Symbolic Regression ({\tt ESR}) with MDL-based selection, Physical Symbolic Optimization ({\tt PhySO}) using deep reinforcement learning, and Multi-View Symbolic Regression ({\tt MvSR}). By deriving compact analytic functions (complexity $\leq 10$) on a representative training subset and subsequently evaluating them via a 5-fold stratified cross-validation protocol on 100,000 spectroscopically confirmed objects, we map spectroscopic redshift ($z$) to continuous classification scores. Our results demonstrate that these low-complexity expressions achieve high predictive reliability, with {\tt MvSR} reaching a Cohen's Kappa of 0.8948 and {\tt PhySO} achieving exceptional parametric stability ($σ< 0.002$). We show that these models not only match the performance of traditional baselines but also provide a transparent, mathematically concise characterization of the astrophysical boundaries separating galactic and extragalactic populations.

Comparison of symbolic regression algorithms in Star/galaxy/quasar separation

TL;DR

Abstract

) on a representative training subset and subsequently evaluating them via a 5-fold stratified cross-validation protocol on 100,000 spectroscopically confirmed objects, we map spectroscopic redshift (

) to continuous classification scores. Our results demonstrate that these low-complexity expressions achieve high predictive reliability, with {\tt MvSR} reaching a Cohen's Kappa of 0.8948 and {\tt PhySO} achieving exceptional parametric stability (

). We show that these models not only match the performance of traditional baselines but also provide a transparent, mathematically concise characterization of the astrophysical boundaries separating galactic and extragalactic populations.

Paper Structure (44 sections, 41 equations, 24 figures, 5 tables)

This paper contains 44 sections, 41 equations, 24 figures, 5 tables.

INTRODUCTION
Dataset
Methodology
General Approach
Experimental Protocol and Data Preprocessing
Symbolic Regression Algorithms
PySR: Evolutionary Island-Model Search
Model Selection and Thresholding
Exhaustive Symbolic Regression ( ESR) with MDL Selection
Deep Reinforcement Learning Symbolic Regression ( PhySO)
Multi-View Symbolic Regression ( MvSR)
Machine Learning Benchmarks
Random Forest (RF)
Support Vector Machine (SVM)
Multi-Layer Perceptron (MLP)
...and 29 more sections

Figures (24)

Figure 1: Class distribution of the SDSS DR17 dataset across GALAXY, STAR, and QSO categories.
Figure 2: Normalized redshift distributions of stars, galaxies, and quasars shown on a logarithmic scale.
Figure 3: $u - g$ color as a function of redshift for stars, galaxies, and quasars, highlighting population-specific trends.
Figure 4: Unified experimental pipeline for SDSS DR17 classification.
Figure 5: Pareto-optimal frontier for the PySR discovery phase. The plot illustrates the trade-off between functional complexity and $L_2$ loss. A significant performance gain is observed at $C=6$, corresponding to the introduction of the exponential ($\exp$) operator, which captures the non-linear population transitions. The final scoring function $s(z)$ (indicated by the star) was selected at the complexity limit of $C=10$ to maximize representational accuracy while maintaining interpretability, following the scaling law selection criteria discussed in Darc.
...and 19 more figures

Comparison of symbolic regression algorithms in Star/galaxy/quasar separation

TL;DR

Abstract

Comparison of symbolic regression algorithms in Star/galaxy/quasar separation

Authors

TL;DR

Abstract

Table of Contents

Figures (24)