Table of Contents
Fetching ...

Exhaustive Symbolic Regression

Deaglan J. Bartlett, Harry Desmond, Pedro G. Ferreira

TL;DR

Exhaustive Symbolic Regression (ESR) introduces a deterministic approach to symbolic regression by exhaustively enumerating all analytic expressions up to a predefined complexity from a fixed operator basis and then selecting the best model using the minimum description length (MDL) criterion. This framework guarantees finding the best-fitting function at a given complexity (assuming perfect parameter optimisation) and collapses the Pareto front of accuracy versus simplicity into a single objective, facilitating robust model selection. Demonstrated on cosmological data from cosmic chronometers and the Pantheon+ SN catalog, ESR identifies simple, low-parameter functions that can fit the expansion history comparably to or better than the Friedmann equation in MDL terms, while revealing the limitations of current data to uniquely favor $ ext{LCDM}$ over alternative histories. The work provides extensive documentation of the method, a computational release, and a discussion of future enhancements, including multivariate extensions and improved duplicate handling.

Abstract

Symbolic Regression (SR) algorithms attempt to learn analytic expressions which fit data accurately and in a highly interpretable manner. Conventional SR suffers from two fundamental issues which we address here. First, these methods search the space stochastically (typically using genetic programming) and hence do not necessarily find the best function. Second, the criteria used to select the equation optimally balancing accuracy with simplicity have been variable and subjective. To address these issues we introduce Exhaustive Symbolic Regression (ESR), which systematically and efficiently considers all possible equations -- made with a given basis set of operators and up to a specified maximum complexity -- and is therefore guaranteed to find the true optimum (if parameters are perfectly optimised) and a complete function ranking subject to these constraints. We implement the minimum description length principle as a rigorous method for combining these preferences into a single objective. To illustrate the power of ESR we apply it to a catalogue of cosmic chronometers and the Pantheon+ sample of supernovae to learn the Hubble rate as a function of redshift, finding $\sim$40 functions (out of 5.2 million trial functions) that fit the data more economically than the Friedmann equation. These low-redshift data therefore do not uniquely prefer the expansion history of the standard model of cosmology. We make our code and full equation sets publicly available.

Exhaustive Symbolic Regression

TL;DR

Exhaustive Symbolic Regression (ESR) introduces a deterministic approach to symbolic regression by exhaustively enumerating all analytic expressions up to a predefined complexity from a fixed operator basis and then selecting the best model using the minimum description length (MDL) criterion. This framework guarantees finding the best-fitting function at a given complexity (assuming perfect parameter optimisation) and collapses the Pareto front of accuracy versus simplicity into a single objective, facilitating robust model selection. Demonstrated on cosmological data from cosmic chronometers and the Pantheon+ SN catalog, ESR identifies simple, low-parameter functions that can fit the expansion history comparably to or better than the Friedmann equation in MDL terms, while revealing the limitations of current data to uniquely favor over alternative histories. The work provides extensive documentation of the method, a computational release, and a discussion of future enhancements, including multivariate extensions and improved duplicate handling.

Abstract

Symbolic Regression (SR) algorithms attempt to learn analytic expressions which fit data accurately and in a highly interpretable manner. Conventional SR suffers from two fundamental issues which we address here. First, these methods search the space stochastically (typically using genetic programming) and hence do not necessarily find the best function. Second, the criteria used to select the equation optimally balancing accuracy with simplicity have been variable and subjective. To address these issues we introduce Exhaustive Symbolic Regression (ESR), which systematically and efficiently considers all possible equations -- made with a given basis set of operators and up to a specified maximum complexity -- and is therefore guaranteed to find the true optimum (if parameters are perfectly optimised) and a complete function ranking subject to these constraints. We implement the minimum description length principle as a rigorous method for combining these preferences into a single objective. To illustrate the power of ESR we apply it to a catalogue of cosmic chronometers and the Pantheon+ sample of supernovae to learn the Hubble rate as a function of redshift, finding 40 functions (out of 5.2 million trial functions) that fit the data more economically than the Friedmann equation. These low-redshift data therefore do not uniquely prefer the expansion history of the standard model of cosmology. We make our code and full equation sets publicly available.
Paper Structure (20 sections, 18 equations, 7 figures, 2 tables)

This paper contains 20 sections, 18 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Representations of the expression $\left(\log\left(x\right)\right)^{\theta_0}+ \exp\left( \theta_1 x \right)$ as a tree and as a list of operators. One can generate the tree from the list using the traversal rule outlined in \ref{['sec:tree_representation']}. The mapping from the list to the function is unique, however a given equation does not necessarily have a unique tree representation.
  • Figure 2: The number of trial functions containing $p$ parameters at each complexity constructed from the basis functions listed. The solid lines indicate the total number of equations, and the dashed lines are the number of unique equations identified in the ESR search.
  • Figure 3: Pareto front of functions generated with the ESR algorithm, compared to the cosmic chronometer (upper) and Pantheon+ (lower) data. We show the best-fitting functions according the change in the description length, $L$, and the likelihood, $\mathcal{L}$, relative to the corresponding minima, alongside the best-fit $\Lambda$CDM solutions and a more general Friedmann equation (\ref{['eq:Friedmann free w']}).
  • Figure 4: Hubble parameter, $H$, as a function of redshift, $z$, learned using the ESR algorithm from cosmic chronometer data (blue points). In the bottom panel we subtract the best-fit $\Lambda$CDM prediction. We plot the top 150 functions up to complexity 10, ranked and coloured by their description length, $L\left(D\right)$, relative to the minimum, MDL.
  • Figure 5: Distance moduli, $\mu$, to Type Ia supernovae in the Pantheon+ catalogue (blue points) as a function of redshift, $z$. We plot the 150 highest ranked functions up to complexity 10 inferred using the ESR algorithm and in the lower panel we plot the residuals relative to the best-fit $\Lambda$CDM prediction.
  • ...and 2 more figures