Table of Contents
Fetching ...

Benchmarking symbolic regression constant optimization schemes

L. G. A dos Reis, V. L. P. S. Caminha, T. J. P. Penna

TL;DR

The paper addresses how different constant optimization schemes used during the evolutionary search of SR models affect performance. It benchmarks eight methods across ten univariate SR benchmarks, introducing Tree Edit Distance ($TED$) as a symbolic-accuracy metric and a preprocessing pipeline for fair comparison. Findings show no universal winner; results depend on problem difficulty and expression size, with $TED$ correlating with size and revealing symbolic accuracy not captured by $MSE$ alone. The authors advocate a combined $MSE$-$TED$ analysis for evaluating SR solutions and suggest PSO and BFGS as robust default optimizers, with LM and Nelder-Mead as strong alternatives for harder problems.

Abstract

Symbolic regression is a machine learning technique, and it has seen many advancements in recent years, especially in genetic programming approaches (GPSR). Furthermore, it has been known for many years that constant optimization of parameters, during the evolutionary search, greatly increases GPSR performance However, different authors approach such tasks differently and no consensus exists regarding which methods perform best. In this work, we evaluate eight different parameter optimization methods, applied during evolutionary search, over ten known benchmark problems, in two different scenarios. We also propose using an under-explored metric called Tree Edit Distance (TED), aiming to identify symbolic accuracy. In conjunction with classical error measures, we develop a combined analysis of model performance in symbolic regression. We then show that different constant optimization methods perform better in certain scenarios and that there is no overall best choice for every problem. Finally, we discuss how common metric decisions may be biased and appear to generate better models in comparison.

Benchmarking symbolic regression constant optimization schemes

TL;DR

The paper addresses how different constant optimization schemes used during the evolutionary search of SR models affect performance. It benchmarks eight methods across ten univariate SR benchmarks, introducing Tree Edit Distance () as a symbolic-accuracy metric and a preprocessing pipeline for fair comparison. Findings show no universal winner; results depend on problem difficulty and expression size, with correlating with size and revealing symbolic accuracy not captured by alone. The authors advocate a combined - analysis for evaluating SR solutions and suggest PSO and BFGS as robust default optimizers, with LM and Nelder-Mead as strong alternatives for harder problems.

Abstract

Symbolic regression is a machine learning technique, and it has seen many advancements in recent years, especially in genetic programming approaches (GPSR). Furthermore, it has been known for many years that constant optimization of parameters, during the evolutionary search, greatly increases GPSR performance However, different authors approach such tasks differently and no consensus exists regarding which methods perform best. In this work, we evaluate eight different parameter optimization methods, applied during evolutionary search, over ten known benchmark problems, in two different scenarios. We also propose using an under-explored metric called Tree Edit Distance (TED), aiming to identify symbolic accuracy. In conjunction with classical error measures, we develop a combined analysis of model performance in symbolic regression. We then show that different constant optimization methods perform better in certain scenarios and that there is no overall best choice for every problem. Finally, we discuss how common metric decisions may be biased and appear to generate better models in comparison.

Paper Structure

This paper contains 10 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Constants (terminal nodes) in (a) expression tree are replaced by (b) ephemeral random constants. (c) is a vector that stores the constants somewhere else for easy access and future manipulation.
  • Figure 2: Illustration of the process of TED calculation for an expression tree. The shadow tree represents the expected solution. In the SR context, $T_1$ is the predicted expression, while $T_2$ is the expected solution. (a) The node $\text{abs}$ is removed. (b) $a$ is substituted by $cos$. (c) A variable is added below the $cos$ node. Since three operations were required to transform the $T_1$ into $T_2$, $TED(T_1, T_2) = 3$.
  • Figure 3: Preprocessing output expression. Simplify, NSimplify and Evalf are methods from Pythons library Sympy. Recursive simplify is a custom made function showed in code \ref{['code:recursive_simplify']}. At last, ERC represents the conversion of numerical constants into abstract.
  • Figure 4: Representative MSE distributions for the problems at hand. Outliers were removed for better visualization. The annotations on the top right corner of each figure are an arbitrary classification used to simplify the analysis, it translates: (E) easy, (M) medium, (H) and hard.
  • Figure 5: TED distribution for a few representative cases studied. Outliers were removed for better visualization and annotations on the top right corner are an arbitrary classification used to simplify analysis. It reads: (E) easy, (M) medium, and (H) hard.
  • ...and 5 more figures