Improving Genetic Programming for Symbolic Regression with Equality Graphs
Fabricio Olivetti de Franca, Gabriel Kronberger
TL;DR
The paper tackles redundancy in genetic programming for symbolic regression by revisiting equivalent expressions. It introduces eggp, a history-aware GP that uses equality saturation and an e-graph to store expressions and their equivalents, guiding crossover and mutation to generate unvisited forms. Empirical results on SRBench and real-world datasets show that eggp variants achieve competitive or superior metrics such as $R^2$ and AUC while producing smaller models, with runtimes between those of Operon and PySR. The work demonstrates the practical value of history-aware search in SR and points to future extensions of e-graph-guided operators and scalability considerations.
Abstract
The search for symbolic regression models with genetic programming (GP) has a tendency of revisiting expressions in their original or equivalent forms. Repeatedly evaluating equivalent expressions is inefficient, as it does not immediately lead to better solutions. However, evolutionary algorithms require diversity and should allow the accumulation of inactive building blocks that can play an important role at a later point. The equality graph is a data structure capable of compactly storing expressions and their equivalent forms allowing an efficient verification of whether an expression has been visited in any of their stored equivalent forms. We exploit the e-graph to adapt the subtree operators to reduce the chances of revisiting expressions. Our adaptation, called eggp, stores every visited expression in the e-graph, allowing us to filter out from the available selection of subtrees all the combinations that would create already visited expressions. Results show that, for small expressions, this approach improves the performance of a simple GP algorithm to compete with PySR and Operon without increasing computational cost. As a highlight, eggp was capable of reliably delivering short and at the same time accurate models for a selected set of benchmarks from SRBench and a set of real-world datasets.
