Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion
Ryan T. Tymkow, Benjamin D. Schnapp, Mojtaba Valipour, Ali Ghodshi
TL;DR
This work tackles symbolic regression by reframing equation generation as a discrete-state diffusion problem. It introduces Symbolic Diffusion, a D3PM-based model that generates all tokens of an equation in parallel, using a PointNet-style encoder and a transformer decoder, and is designed for a fair comparison with an autoregressive SymbolicGPT baseline that shares architecture. On a bivariate dataset, Symbolic Diffusion achieves competitive performance, with a statistically higher mean $R^2$ while autoregressive models may have higher token-accuracy at tight tolerances, demonstrating that diffusion-based generation with global context is a viable alternative for neural-symbolic regression. The study provides open-source code and highlights the potential for diffusion models to improve syntactic validity and inference speed in symbolic regression, suggesting directions for scaling to higher dimensions and end-to-end constant prediction.
Abstract
Symbolic regression refers to the task of finding a closed-form mathematical expression to fit a set of data points. Genetic programming based techniques are the most common algorithms used to tackle this problem, but recently, neural-network based approaches have gained popularity. Most of the leading neural-network based models used for symbolic regression utilize transformer-based autoregressive models to generate an equation conditioned on encoded input points. However, autoregressive generation is limited to generating tokens left-to-right, and future generated tokens are conditioned only on previously generated tokens. Motivated by the desire to generate all tokens simultaneously to produce improved closed-form equations, we propose Symbolic Diffusion, a D3PM based discrete state-space diffusion model which simultaneously generates all tokens of the equation at once using discrete token diffusion. Using the bivariate dataset developed for SymbolicGPT, we compared our diffusion-based generation approach to an autoregressive model based on SymbolicGPT, using equivalent encoder and transformer architectures. We demonstrate that our novel approach of using diffusion-based generation for symbolic regression can offer comparable and, by some metrics, improved performance over autoregressive generation in models using similar underlying architectures, opening new research opportunities in neural-network based symbolic regression.
