EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
Nan Jiang, Ziyi Wang, Yexiang Xue
TL;DR
EGG-SR tackles the NP-hard search space of symbolic regression by encoding symbolic equivalence with equality graphs (e-graphs) and integrating this structure into diverse learning paradigms. The framework yields Egg-MCTS, Egg-DRL, and Egg-LLM, which prune redundant paths, stabilize training, and enrich feedback with equivalent expressions. Theoretical results show a tighter regret bound for Egg-MCTS via a reduced effective branching factor $ abla_ ext{infty} \le \kappa$ and a lower-variance, unbiased gradient estimator for Egg-DRL. Empirically, Egg-SR improves normalized mean squared error across multiple benchmarks and demonstrates favorable space and time efficiency, highlighting its potential to accelerate discovery of governing equations in science and engineering.
Abstract
Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the effective search space and accelerating training lies in symbolic equivalence: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates equality graphs (e-graphs) into diverse symbolic regression algorithms, including Monte Carlo Tree Search (MCTS), deep reinforcement learning (DRL), and large language models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module, enabling more efficient learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalence classes in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Under mild assumptions, we show that embedding e-graphs tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances multiple baselines across challenging benchmarks, discovering equations with lower normalized mean squared error than state-of-the-art methods. Code implementation is available at: https://www.github.com/jiangnanhugo/egg-sr.
