Table of Contents
Fetching ...

Equality Graph Assisted Symbolic Regression

Fabricio Olivetti de Franca, Gabriel Kronberger

TL;DR

This paper tackles the inefficiency of genetic-programming–driven symbolic regression caused by redundant, equivalent expressions. It introduces SymRegg, a non-population search that leverages equality graphs and equality saturation to store visited expressions and generate unvisited equivalents, using a minimal set of hyperparameters. Empirical results on four real-world datasets show SymRegg approaches ideal efficiency and often surpasses baseline GP methods in evaluation count, while remaining competitive in accuracy. The approach offers a scalable, interpretable alternative to population-based SR with practical benefits for real-world equation discovery.

Abstract

In Symbolic Regression (SR), Genetic Programming (GP) is a popular search algorithm that delivers state-of-the-art results in term of accuracy. Its success relies on the concept of neutrality, which induces large plateaus that the search can safely navigate to more promising regions. Navigating these plateaus, while necessary, requires the computation of redundant expressions, up to 60% of the total number of evaluation, as noted in a recent study. The equality graph (e-graph) structure can compactly store and group equivalent expressions enabling us to verify if a given expression and their variations were already visited by the search, thus enabling us to avoid unnecessary computation. We propose a new search algorithm for symbolic regression called SymRegg that revolves around the e-graph structure following simple steps: perturb solutions sampled from a selection of expressions stored in the e-graph, if it generates an unvisited expression, insert it into the e-graph and generates its equivalent forms. We show that SymRegg is capable of improving the efficiency of the search, maintaining consistently accurate results across different datasets while requiring a choice of a minimalist set of hyperparameters.

Equality Graph Assisted Symbolic Regression

TL;DR

This paper tackles the inefficiency of genetic-programming–driven symbolic regression caused by redundant, equivalent expressions. It introduces SymRegg, a non-population search that leverages equality graphs and equality saturation to store visited expressions and generate unvisited equivalents, using a minimal set of hyperparameters. Empirical results on four real-world datasets show SymRegg approaches ideal efficiency and often surpasses baseline GP methods in evaluation count, while remaining competitive in accuracy. The approach offers a scalable, interpretable alternative to population-based SR with practical benefits for real-world equation discovery.

Abstract

In Symbolic Regression (SR), Genetic Programming (GP) is a popular search algorithm that delivers state-of-the-art results in term of accuracy. Its success relies on the concept of neutrality, which induces large plateaus that the search can safely navigate to more promising regions. Navigating these plateaus, while necessary, requires the computation of redundant expressions, up to 60% of the total number of evaluation, as noted in a recent study. The equality graph (e-graph) structure can compactly store and group equivalent expressions enabling us to verify if a given expression and their variations were already visited by the search, thus enabling us to avoid unnecessary computation. We propose a new search algorithm for symbolic regression called SymRegg that revolves around the e-graph structure following simple steps: perturb solutions sampled from a selection of expressions stored in the e-graph, if it generates an unvisited expression, insert it into the e-graph and generates its equivalent forms. We show that SymRegg is capable of improving the efficiency of the search, maintaining consistently accurate results across different datasets while requiring a choice of a minimalist set of hyperparameters.

Paper Structure

This paper contains 7 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The expression $2x(x+x)$ represented as (a) a tree, (b) a directed acyclic graph and, (c) an e-graph. The e-graph adds dashed lines around the nodes that correspond to equivalent expressions.
  • Figure 2: (a) Illustrative example of an e-graph. Solid boxes represent e-nodes and dashed lines represent e-classes (id numbers in the lower right). Extracting expressions by following any path of a given e-class will represent equivalent expressions. For example, expressions $2x$ or $x+x$, extracted from e-class $4$. When inserting the expression $x + 2x$ (b), the already present e-classes will be reused (see e-class $8$).
  • Figure 3: Example of the recombination and perturbation operators. In this e-graph the green e-classes represents the root of already evaluated expressions. This example will assume that the expression $x+\textcolor{red}{\mathbf{\sqrt{x}}}$ is sampled and the red highlight is the part of the expression to be replaced The recombination operator will sample the second expression, $x+2x$, that contains the set of subtrees $\{x+2x,x,2x,2\}$. From this set, the subtree $2x$ would create an already visited expression and then it is discarded before selecting the replacement. For the perturbation, supposing that replacing that part with a random subtree creates the expression $x+\textcolor{red}{\mathbf{2x}}$, which already exists in the e-graph, the algorithm will replace the multiplication operator with any other operator with the same arity that would generate an unvisited expression, such as $+,-,\div{},\textasciicircum$.
  • Figure 4: Probability of achieving a solution with MSE value below a given threshold (top) after a number of evaluations.