The Inefficiency of Genetic Programming for Symbolic Regression -- Extended Version
Gabriel Kronberger, Fabricio Olivetti de Franca, Harry Desmond, Deaglan J. Bartlett, Lukas Kammerer
TL;DR
This study quantifies the efficiency of genetic programming for symbolic regression in a finite, exhaustively enumerable search space by coupling GP with parameter optimization to an ESR framework that uses equality saturation to collapse semantically equivalent expressions into canonical forms. By evaluating on the Nikuradse flow and the Radial Acceleration Relation datasets, the authors show that GP explores a small portion of the semantically unique expression space and frequently revisits semantically identical forms, resulting in a lower success probability than an idealized random search within the same space. The work highlights the role of semantic deduplication and exhaustive enumeration in understanding SR algorithm performance, and suggests that GP efficiency could be significantly improved by preventing redundant evaluations and leveraging canonical representations. Overall, the findings question GP’s practicality for SR in constrained, short-expression regimes and point to equalities-saturation-based approaches as a promising avenue for more efficient symbolic regression search strategies.
Abstract
We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This analysis is made possible by improved algorithms for equality saturation, which we use to improve the Exhaustive Symbolic Regression algorithm; this produces the set of semantically unique expression structures, orders of magnitude smaller than the full symbolic regression search space. We compare the efficiency of random search in the set of unique expressions and genetic programming. For our experiments we use two real-world datasets where symbolic regression has been used to produce well-fitting univariate expressions: the Nikuradse dataset of flow in rough pipes and the Radial Acceleration Relation of galaxy dynamics. The results show that genetic programming in such limited settings explores only a small fraction of all unique expressions, and evaluates expressions repeatedly that are congruent to already visited expressions.
