Table of Contents
Fetching ...

EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph

Nan Jiang, Ziyi Wang, Yexiang Xue

TL;DR

EGG-SR tackles the NP-hard search space of symbolic regression by encoding symbolic equivalence with equality graphs (e-graphs) and integrating this structure into diverse learning paradigms. The framework yields Egg-MCTS, Egg-DRL, and Egg-LLM, which prune redundant paths, stabilize training, and enrich feedback with equivalent expressions. Theoretical results show a tighter regret bound for Egg-MCTS via a reduced effective branching factor $ abla_ ext{infty} \le \kappa$ and a lower-variance, unbiased gradient estimator for Egg-DRL. Empirically, Egg-SR improves normalized mean squared error across multiple benchmarks and demonstrates favorable space and time efficiency, highlighting its potential to accelerate discovery of governing equations in science and engineering.

Abstract

Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the effective search space and accelerating training lies in symbolic equivalence: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates equality graphs (e-graphs) into diverse symbolic regression algorithms, including Monte Carlo Tree Search (MCTS), deep reinforcement learning (DRL), and large language models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module, enabling more efficient learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalence classes in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Under mild assumptions, we show that embedding e-graphs tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances multiple baselines across challenging benchmarks, discovering equations with lower normalized mean squared error than state-of-the-art methods. Code implementation is available at: https://www.github.com/jiangnanhugo/egg-sr.

EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph

TL;DR

EGG-SR tackles the NP-hard search space of symbolic regression by encoding symbolic equivalence with equality graphs (e-graphs) and integrating this structure into diverse learning paradigms. The framework yields Egg-MCTS, Egg-DRL, and Egg-LLM, which prune redundant paths, stabilize training, and enrich feedback with equivalent expressions. Theoretical results show a tighter regret bound for Egg-MCTS via a reduced effective branching factor and a lower-variance, unbiased gradient estimator for Egg-DRL. Empirically, Egg-SR improves normalized mean squared error across multiple benchmarks and demonstrates favorable space and time efficiency, highlighting its potential to accelerate discovery of governing equations in science and engineering.

Abstract

Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the effective search space and accelerating training lies in symbolic equivalence: many expressions, although syntactically different, define the same function -- for example, , , and . Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates equality graphs (e-graphs) into diverse symbolic regression algorithms, including Monte Carlo Tree Search (MCTS), deep reinforcement learning (DRL), and large language models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module, enabling more efficient learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalence classes in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Under mild assumptions, we show that embedding e-graphs tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances multiple baselines across challenging benchmarks, discovering equations with lower normalized mean squared error than state-of-the-art methods. Code implementation is available at: https://www.github.com/jiangnanhugo/egg-sr.

Paper Structure

This paper contains 35 sections, 7 theorems, 28 equations, 21 figures, 6 tables.

Key Result

Theorem 3.1

Consider the MCTS learning framework augmented with Egg. As defined in Definitions def:diff-m and def:ref-diff-m, let $T$ denote the total number of learning iterations, $\gamma \in (0,1)$ the discount factor of the corresponding Markov decision process, $\kappa$ be the near-optimal branching factor

Figures (21)

  • Figure 1: Applying the rewrite rule $\log(a\times b) \leadsto \log(a) + \log(b)$ to an e-graph representing expression $\log(x_1^3x_2^2)$. (a) The initialized e-graph consists of e-classes (dashed boxes), each containing equivalent e-nodes (solid boxes). Edges connect e-nodes to their child e-classes. (b) The matching step identifies the e-nodes that match the $\mathtt{LHS}$ of the rule. (c) The substitution step adds new e-classes and edges corresponding to the $\mathtt{RHS}$ to the e-graph. (d) The merging step consolidates equivalent e-classes. The final e-graph in (d) compactly represents two equivalent expressions.
  • Figure 2: Execution pipeline of our Egg-MCTS. (a) Starting at the root of the tree, the algorithm selects the child with the highest UCT score (in equation \ref{['eq:uct']}) until reaching a leaf. (b) The selected leaf is expanded by applying all applicable production rules, producing new child nodes. (c) For each child, several simulations are run to complete the expression template by sampling additional rules. The resulting expressions are fitted to the data to estimate their coefficients. (d) Rewards and visit counts from evaluated children are back-propagated up the tree. Updates are applied to the selected and also equivalent paths (highlighted in two colors), enabled by our Egg module.
  • Figure 3: On the "sincos(3,2,2)" dataset, we show (Left) Search tree size over learning iterations for MCTS and Egg-MCTS, and also (Right) Empirical mean and standard deviation of the estimated quantity for DRL and Egg-DRL.
  • Figure 4: Egg uses less memory than the array-based approach for two settings: (Left)$\log\left(x_1 \times \ldots x_n\right)$ rewritten using $\log(a b) \leadsto \log a + \log b$. (Right)$\sin\left(x_1 + \ldots x_n\right)$ rewritten using $\sin(a + b) \leadsto \sin a \cos b + \sin a \cos b$.
  • Figure 5: The Egg module is time efficient and introduces negligible time overhead, compared with four main computations in DRL. Left: LSTM. Right: Decoder-only Transformer.
  • ...and 16 more figures

Theorems & Definitions (17)

  • Theorem 3.1
  • proof : Proof Sketch
  • Theorem 3.2
  • proof : Proof Sketch
  • Definition 1: Difficulty measure
  • Theorem B.1: Regret bound of MCTS (DBLP:journals/ftml/Munos14, chapter 5)
  • Definition 2: Upper and Lower bounds of the Value function
  • Definition 3: Finer Difficulty Measure
  • Theorem B.2: Regret Bound of Egg-MCTS
  • proof
  • ...and 7 more