Table of Contents
Fetching ...

Neural Structure Embedding for Symbolic Regression via Continuous Structure Search and Coefficient Optimization

Fateme Memar, Tao Zhe, Dongjie Wang

Abstract

Symbolic regression aims to discover human-interpretable equations that explain observational data. However, existing approaches rely heavily on discrete structure search (e.g., genetic programming), which often leads to high computational cost, unstable performance, and limited scalability to large equation spaces. To address these challenges, we propose SRCO, a unified embedding-driven framework for symbolic regression that transforms symbolic structures into a continuous, optimizable representation space. The framework consists of three key components: (1) structure embedding: we first generate a large pool of exploratory equations using traditional symbolic regression algorithms and train a Transformer model to compress symbolic structures into a continuous embedding space; (2) continuous structure search: the embedding space enables efficient exploration using gradient-based or sampling-based optimization, significantly reducing the cost of navigating the combinatorial structure space; and (3) coefficient optimization: for each discovered structure, we treat symbolic coefficients as learnable parameters and apply gradient optimization to obtain accurate numerical values. Experiments on synthetic and real-world datasets show that our approach consistently outperforms state-of-the-art methods in equation accuracy, robustness, and search efficiency. This work introduces a new paradigm for symbolic regression by bridging symbolic equation discovery with continuous embedding learning and optimization.

Neural Structure Embedding for Symbolic Regression via Continuous Structure Search and Coefficient Optimization

Abstract

Symbolic regression aims to discover human-interpretable equations that explain observational data. However, existing approaches rely heavily on discrete structure search (e.g., genetic programming), which often leads to high computational cost, unstable performance, and limited scalability to large equation spaces. To address these challenges, we propose SRCO, a unified embedding-driven framework for symbolic regression that transforms symbolic structures into a continuous, optimizable representation space. The framework consists of three key components: (1) structure embedding: we first generate a large pool of exploratory equations using traditional symbolic regression algorithms and train a Transformer model to compress symbolic structures into a continuous embedding space; (2) continuous structure search: the embedding space enables efficient exploration using gradient-based or sampling-based optimization, significantly reducing the cost of navigating the combinatorial structure space; and (3) coefficient optimization: for each discovered structure, we treat symbolic coefficients as learnable parameters and apply gradient optimization to obtain accurate numerical values. Experiments on synthetic and real-world datasets show that our approach consistently outperforms state-of-the-art methods in equation accuracy, robustness, and search efficiency. This work introduces a new paradigm for symbolic regression by bridging symbolic equation discovery with continuous embedding learning and optimization.
Paper Structure (32 sections, 12 equations, 7 figures, 3 tables)

This paper contains 32 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Framework Overview of SRCO. Structure Embedding: A GP-based SR system generates diverse candidate equations, which are converted into postfix sequences with abstracted coefficients (COF) to train a Transformer-based structural prior; Continuous Structure Search: The learned prior guides constrained sampling in postfix space, followed by syntactic, semantic, and complexity filtering to obtain valid symbolic templates; Coefficient Optimization: For each selected template, COF tokens are instantiated as learnable parameters and optimized via gradient-based regression to produce the final symbolic equation.
  • Figure 2: Ablation of coefficient optimization on Feynman-bonus.1. We compare SRCO’s gradient-based coefficient fitting (model+) to stochastic hill-climbing (random search; model-) while keeping the template, train/test split, and optimization budget fixed. Bars report held-out test-set performance (higher is better for $R^2$ and $\rho$, lower is better for MSE).
  • Figure 3: Average per-equation equation-evaluation time on the test split (seconds; lower is better), averaged over six settings (2 benchmarks $\times$ 3 tiers: Feynman--synthetic/real-world $\times$ easy/medium/hard). SRCO achieves the fastest evaluation (0.00649 s), essentially tied with EFS (0.00651 s) and outperforming DSO (2.6$\times$ slower), FFX (6.1$\times$), and gplearn (38.5$\times$), while maintaining strong accuracy (Tables \ref{['tab:feynman-synth']}--\ref{['tab:feynman-real']}).
  • Figure 4: Pearson correlation $\rho$ for max_term. Accuracy improves monotonically and saturates around 18--22 terms.
  • Figure 5: $R^2$ for max_term. Results mirror Pearson correlation $\rho$, with diminishing returns after 18--22 terms.
  • ...and 2 more figures