Table of Contents
Fetching ...

Class Symbolic Regression: Gotta Fit 'Em All

Wassim Tenachi, Rodrigo Ibata, Thibaut L. François, Foivos I. Diakogiannis

TL;DR

Class Symbolic Regression (Class SR) addresses the problem of discovering a single analytic form that simultaneously fits multiple related datasets by allowing dataset-specific parameters while sharing class-wide parameters. Built on the Phi-SO framework, it combines dimensional analysis constraints with deep reinforcement learning to search for universal governing laws, and optimizes expressions with an LBFGS-based fitting over realizations, guided by a reward derived from the normalized RMSE. The authors demonstrate the approach on a first Class SR benchmark of eight physics-inspired problems and on an astrophysical application to recover a Milky Way–like NFW potential from stellar stream data, showing superior exact symbolic recovery and robustness to noise compared to traditional single-dataset SR. This work advances interpretable, physics-informed symbolic discovery in multi-dataset settings and offers practical tools for extracting universal laws in complex scientific domains. Key contributions include: (i) introducing Class SR as a hierarchical extension of Phi-SO for multi-dataset symbolic regression; (ii) defining a concrete optimization-and-RL loop that jointly tunes class and realization-specific parameters; (iii) creating a first Class SR benchmark and demonstrating improved performance, especially under measurement noise; (iv) validating the method with an astrophysical example that yields a concise analytic potential from stellar streams.

Abstract

We introduce 'Class Symbolic Regression' (Class SR) a first framework for automatically finding a single analytical functional form that accurately fits multiple datasets - each realization being governed by its own (possibly) unique set of fitting parameters. This hierarchical framework leverages the common constraint that all the members of a single class of physical phenomena follow a common governing law. Our approach extends the capabilities of our earlier Physical Symbolic Optimization ($Φ$-SO) framework for Symbolic Regression, which integrates dimensional analysis constraints and deep reinforcement learning for unsupervised symbolic analytical function discovery from data. Additionally, we introduce the first Class SR benchmark, comprising a series of synthetic physical challenges specifically designed to evaluate such algorithms. We demonstrate the efficacy of our novel approach by applying it to these benchmark challenges and showcase its practical utility for astrophysics by successfully extracting an analytic galaxy potential from a set of simulated orbits approximating stellar streams.

Class Symbolic Regression: Gotta Fit 'Em All

TL;DR

Class Symbolic Regression (Class SR) addresses the problem of discovering a single analytic form that simultaneously fits multiple related datasets by allowing dataset-specific parameters while sharing class-wide parameters. Built on the Phi-SO framework, it combines dimensional analysis constraints with deep reinforcement learning to search for universal governing laws, and optimizes expressions with an LBFGS-based fitting over realizations, guided by a reward derived from the normalized RMSE. The authors demonstrate the approach on a first Class SR benchmark of eight physics-inspired problems and on an astrophysical application to recover a Milky Way–like NFW potential from stellar stream data, showing superior exact symbolic recovery and robustness to noise compared to traditional single-dataset SR. This work advances interpretable, physics-informed symbolic discovery in multi-dataset settings and offers practical tools for extracting universal laws in complex scientific domains. Key contributions include: (i) introducing Class SR as a hierarchical extension of Phi-SO for multi-dataset symbolic regression; (ii) defining a concrete optimization-and-RL loop that jointly tunes class and realization-specific parameters; (iii) creating a first Class SR benchmark and demonstrating improved performance, especially under measurement noise; (iv) validating the method with an astrophysical example that yields a concise analytic potential from stellar streams.

Abstract

We introduce 'Class Symbolic Regression' (Class SR) a first framework for automatically finding a single analytical functional form that accurately fits multiple datasets - each realization being governed by its own (possibly) unique set of fitting parameters. This hierarchical framework leverages the common constraint that all the members of a single class of physical phenomena follow a common governing law. Our approach extends the capabilities of our earlier Physical Symbolic Optimization (-SO) framework for Symbolic Regression, which integrates dimensional analysis constraints and deep reinforcement learning for unsupervised symbolic analytical function discovery from data. Additionally, we introduce the first Class SR benchmark, comprising a series of synthetic physical challenges specifically designed to evaluate such algorithms. We demonstrate the efficacy of our novel approach by applying it to these benchmark challenges and showcase its practical utility for astrophysics by successfully extracting an analytic galaxy potential from a set of simulated orbits approximating stellar streams.
Paper Structure (5 sections, 2 equations, 4 figures, 1 table)

This paper contains 5 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Class Symbolic Regression framework sketch: searching for a unique functional form simultaneously fitting multiple datasets. The process starts at the left hand side, a batch of trial class analytical expressions are generated using our $\Phi$-SO framework physo_paper. The free parameters appearing in those expressions are then optimized in a dataset-specific manner i.e. allowing each dataset to have its own unique associated values for each parameter. The neural network used to generate the trial expressions is then reinforced based on the fit quality of the trial symbolic functions. This process is repeated until convergence.
  • Figure 2: Comparison of exact symbolic recovery rates and rate of accurate expressions (having $R^2 > 0.999$) between Class SR and standard SR on our Class SR challenges using an SRBench-style benchmarking pipeline SRBench. This figure demonstrates the enhanced effectiveness of Class SR in identifying common underlying functions across multiple datasets with varying scale parameters, resulting in a higher success rate compared to the traditional SR method exploiting only one dataset at a time - especially in the presence of noise.
  • Figure 3: Synthetic stellar stream data utilized by our algorithm to recover the galactic potential. The left and middle panels display the spatial positions of stream members relative to the Milky Way, while the right panel illustrates the kinetic energy of these members as a function of their radial distance from the galactic center.
  • Figure 4: This figure presents the exact symbolic recovery rate and median $R^2$ achieved by our Class SR algorithm in the task of recovering an NFW dark matter halo model 1997ApJ...490..493N from synthetic datasets of stellar stream positions and velocities. The performance metrics are displayed as functions of noise levels and the number of realizations exploited. The edge case, in which a single realization is used, corresponds to the conditions of traditional SR. The results distinctly demonstrate that Class SR substantially outperforms traditional SR, particularly in noisy environments.