Benchmarking symbolic regression constant optimization schemes
L. G. A dos Reis, V. L. P. S. Caminha, T. J. P. Penna
TL;DR
The paper addresses how different constant optimization schemes used during the evolutionary search of SR models affect performance. It benchmarks eight methods across ten univariate SR benchmarks, introducing Tree Edit Distance ($TED$) as a symbolic-accuracy metric and a preprocessing pipeline for fair comparison. Findings show no universal winner; results depend on problem difficulty and expression size, with $TED$ correlating with size and revealing symbolic accuracy not captured by $MSE$ alone. The authors advocate a combined $MSE$-$TED$ analysis for evaluating SR solutions and suggest PSO and BFGS as robust default optimizers, with LM and Nelder-Mead as strong alternatives for harder problems.
Abstract
Symbolic regression is a machine learning technique, and it has seen many advancements in recent years, especially in genetic programming approaches (GPSR). Furthermore, it has been known for many years that constant optimization of parameters, during the evolutionary search, greatly increases GPSR performance However, different authors approach such tasks differently and no consensus exists regarding which methods perform best. In this work, we evaluate eight different parameter optimization methods, applied during evolutionary search, over ten known benchmark problems, in two different scenarios. We also propose using an under-explored metric called Tree Edit Distance (TED), aiming to identify symbolic accuracy. In conjunction with classical error measures, we develop a combined analysis of model performance in symbolic regression. We then show that different constant optimization methods perform better in certain scenarios and that there is no overall best choice for every problem. Finally, we discuss how common metric decisions may be biased and appear to generate better models in comparison.
