Table of Contents
Fetching ...

Improving Genetic Programming for Symbolic Regression with Equality Graphs

Fabricio Olivetti de Franca, Gabriel Kronberger

TL;DR

The paper tackles redundancy in genetic programming for symbolic regression by revisiting equivalent expressions. It introduces eggp, a history-aware GP that uses equality saturation and an e-graph to store expressions and their equivalents, guiding crossover and mutation to generate unvisited forms. Empirical results on SRBench and real-world datasets show that eggp variants achieve competitive or superior metrics such as $R^2$ and AUC while producing smaller models, with runtimes between those of Operon and PySR. The work demonstrates the practical value of history-aware search in SR and points to future extensions of e-graph-guided operators and scalability considerations.

Abstract

The search for symbolic regression models with genetic programming (GP) has a tendency of revisiting expressions in their original or equivalent forms. Repeatedly evaluating equivalent expressions is inefficient, as it does not immediately lead to better solutions. However, evolutionary algorithms require diversity and should allow the accumulation of inactive building blocks that can play an important role at a later point. The equality graph is a data structure capable of compactly storing expressions and their equivalent forms allowing an efficient verification of whether an expression has been visited in any of their stored equivalent forms. We exploit the e-graph to adapt the subtree operators to reduce the chances of revisiting expressions. Our adaptation, called eggp, stores every visited expression in the e-graph, allowing us to filter out from the available selection of subtrees all the combinations that would create already visited expressions. Results show that, for small expressions, this approach improves the performance of a simple GP algorithm to compete with PySR and Operon without increasing computational cost. As a highlight, eggp was capable of reliably delivering short and at the same time accurate models for a selected set of benchmarks from SRBench and a set of real-world datasets.

Improving Genetic Programming for Symbolic Regression with Equality Graphs

TL;DR

The paper tackles redundancy in genetic programming for symbolic regression by revisiting equivalent expressions. It introduces eggp, a history-aware GP that uses equality saturation and an e-graph to store expressions and their equivalents, guiding crossover and mutation to generate unvisited forms. Empirical results on SRBench and real-world datasets show that eggp variants achieve competitive or superior metrics such as and AUC while producing smaller models, with runtimes between those of Operon and PySR. The work demonstrates the practical value of history-aware search in SR and points to future extensions of e-graph-guided operators and scalability considerations.

Abstract

The search for symbolic regression models with genetic programming (GP) has a tendency of revisiting expressions in their original or equivalent forms. Repeatedly evaluating equivalent expressions is inefficient, as it does not immediately lead to better solutions. However, evolutionary algorithms require diversity and should allow the accumulation of inactive building blocks that can play an important role at a later point. The equality graph is a data structure capable of compactly storing expressions and their equivalent forms allowing an efficient verification of whether an expression has been visited in any of their stored equivalent forms. We exploit the e-graph to adapt the subtree operators to reduce the chances of revisiting expressions. Our adaptation, called eggp, stores every visited expression in the e-graph, allowing us to filter out from the available selection of subtrees all the combinations that would create already visited expressions. Results show that, for small expressions, this approach improves the performance of a simple GP algorithm to compete with PySR and Operon without increasing computational cost. As a highlight, eggp was capable of reliably delivering short and at the same time accurate models for a selected set of benchmarks from SRBench and a set of real-world datasets.

Paper Structure

This paper contains 10 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Illustrative example of an e-graph (the left box shows the expressions evaluated at each e-class) and (b) the same e-graph after inserting the expression $x + 2x$.
  • Figure 2: Examples using the e-graph in Fig. \ref{['fig:egraph2b']} of (a) recombination between two expressions: after choosing the recombination point marked in bold in the first tree, the second tree has only two points which will generate new expressions (marked in bold in the second expression), after picking one of these points, we generate the new solution illustrated in the tree to the right; (b) mutation: after choosing the mutation point, a new subtree is generated. If the new expression is already contained in the e-graph, the root of the subtree is changed by a random non-terminal that creates an unvisited expression.
  • Figure 3: Performance plots for the SRBench datasets. This plot shows the probability of returning an $R^2$ equal or larger than $x$ on a random run of each algorithm.
  • Figure 4: Performance plots for the real-world datasets. This plot shows the probability of returning an $R^2$ equal or larger than $x$ on a random run of each algorithm.
  • Figure 5: Relative avg. runtime of each algorithm using Operon as a baseline.