Table of Contents
Fetching ...

Deep Generative Symbolic Regression

Samuel Holt, Zhaozhi Qian, Mihaela van der Schaar

TL;DR

The paper addresses the scalability and efficiency limitations of symbolic regression by introducing Deep Generative Symbolic Regression (DGSR), which learns equation invariances via a pre-trained conditional generative model and refines it at inference time for efficient MAP-like equation search. DGSR uses a set-transformer encoder and transformer decoder to model $p_\\theta(f|\\mathcal{D})$, trained with an end-to-end NMSE loss that respects equation equivalences, and employs neural-guided priority queue training (NGPQT) for inference-time refinement plus discrete search to select the best equation. Across diverse benchmarks (Feynman, SRBench, Nguyen, Livermore, and synthetic $d=12$), DGSR achieves higher recovery rates with more inputs, matches or surpasses state-of-the-art baselines, and reduces inference-time computation compared to RL-based symbolic regression methods. The work demonstrates DGSR’s ability to capture invariances, generalize to unseen variables, and provide a practical, scalable framework for discovering concise scientific equations. It also outlines limitations and avenues for future work, such as handling highly complex equations and improving constant optimization for broader real-world applicability.

Abstract

Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the observation that closed-form equations often have structural characteristics and invariances (e.g., the commutative law) that could be further exploited to build more effective symbolic regression solutions. Motivated by this observation, our key contribution is to leverage pre-trained deep generative models to capture the intrinsic regularities of equations, thereby providing a solid foundation for subsequent optimization steps. We show that our novel formalism unifies several prominent approaches of symbolic regression and offers a new perspective to justify and improve on the previous ad hoc designs, such as the usage of cross-entropy loss during pre-training. Specifically, we propose an instantiation of our framework, Deep Generative Symbolic Regression (DGSR). In our experiments, we show that DGSR achieves a higher recovery rate of true equations in the setting of a larger number of input variables, and it is more computationally efficient at inference time than state-of-the-art RL symbolic regression solutions.

Deep Generative Symbolic Regression

TL;DR

The paper addresses the scalability and efficiency limitations of symbolic regression by introducing Deep Generative Symbolic Regression (DGSR), which learns equation invariances via a pre-trained conditional generative model and refines it at inference time for efficient MAP-like equation search. DGSR uses a set-transformer encoder and transformer decoder to model , trained with an end-to-end NMSE loss that respects equation equivalences, and employs neural-guided priority queue training (NGPQT) for inference-time refinement plus discrete search to select the best equation. Across diverse benchmarks (Feynman, SRBench, Nguyen, Livermore, and synthetic ), DGSR achieves higher recovery rates with more inputs, matches or surpasses state-of-the-art baselines, and reduces inference-time computation compared to RL-based symbolic regression methods. The work demonstrates DGSR’s ability to capture invariances, generalize to unseen variables, and provide a practical, scalable framework for discovering concise scientific equations. It also outlines limitations and avenues for future work, such as handling highly complex equations and improving constant optimization for broader real-world applicability.

Abstract

Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the observation that closed-form equations often have structural characteristics and invariances (e.g., the commutative law) that could be further exploited to build more effective symbolic regression solutions. Motivated by this observation, our key contribution is to leverage pre-trained deep generative models to capture the intrinsic regularities of equations, thereby providing a solid foundation for subsequent optimization steps. We show that our novel formalism unifies several prominent approaches of symbolic regression and offers a new perspective to justify and improve on the previous ad hoc designs, such as the usage of cross-entropy loss during pre-training. Specifically, we propose an instantiation of our framework, Deep Generative Symbolic Regression (DGSR). In our experiments, we show that DGSR achieves a higher recovery rate of true equations in the setting of a larger number of input variables, and it is more computationally efficient at inference time than state-of-the-art RL symbolic regression solutions.
Paper Structure (35 sections, 7 equations, 13 figures, 37 tables, 2 algorithms)

This paper contains 35 sections, 7 equations, 13 figures, 37 tables, 2 algorithms.

Figures (13)

  • Figure 1: The data generating process.
  • Figure 2: Block diagram of DGSR. DGSR is able to learn the invariances of equations and datasets $\mathcal{D}$ (P1) by having both: (1) an encoding architecture that is permutation invariant across the number of samples $n$ in the observed dataset $\mathcal{D}=\{(\mathbf{X}_i,y_i)\}_{i=1}^n$, and (2) a Bayesian inspired end-to-end loss NMSE function, Eq. \ref{['main_loss']} from the encoded dataset $\mathcal{D}$ to the outputs from the predicted equations, i.e., $\text{NMSE}(\hat{f}(\mathbf{X}),\mathbf{y})$. The highlighted boundaries show the subset of pre-trained encoder-decoder methods and RL methods.
  • Figure 3: (a) Number of unique ground truth $f^*$ equivalent equations discovered for problem Feynman-7 (A. \ref{['Feynmand2Results']}), (b) Percentage of valid equations generated from a sample of $k$ for problem Feynman-7 (A. \ref{['Feynmand2Results']}), (c) Average recovery rate of Feynman $d=2$, Feynman $d=5$ and Synthetic $d=12$ benchmark problem sets plotted against the number of input variables $d$.
  • Figure 4: (a-b) Pareto front of test NMSE against equation complexity. Labelled: (a) Feynman-8, (b) Feynman-13. Ground truth equation complexity is the red line. Equations discovered are listed in A. \ref{['Feynmand5ParetoFrontAnalysis']}. (c) Negative log-likelihood of the ground truth true equation $f^*$ for problem Feynman-7 (A. \ref{['Feynmand2Results']}).
  • Figure 5: Percentage of valid equations generated from a sample of $k$ equations on the Feynman-7 problem (A. \ref{['Feynmand2Results']}), with a different optimizer, that of petersen2020deep.
  • ...and 8 more figures