Table of Contents
Fetching ...

Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion

Ryan T. Tymkow, Benjamin D. Schnapp, Mojtaba Valipour, Ali Ghodshi

TL;DR

This work tackles symbolic regression by reframing equation generation as a discrete-state diffusion problem. It introduces Symbolic Diffusion, a D3PM-based model that generates all tokens of an equation in parallel, using a PointNet-style encoder and a transformer decoder, and is designed for a fair comparison with an autoregressive SymbolicGPT baseline that shares architecture. On a bivariate dataset, Symbolic Diffusion achieves competitive performance, with a statistically higher mean $R^2$ while autoregressive models may have higher token-accuracy at tight tolerances, demonstrating that diffusion-based generation with global context is a viable alternative for neural-symbolic regression. The study provides open-source code and highlights the potential for diffusion models to improve syntactic validity and inference speed in symbolic regression, suggesting directions for scaling to higher dimensions and end-to-end constant prediction.

Abstract

Symbolic regression refers to the task of finding a closed-form mathematical expression to fit a set of data points. Genetic programming based techniques are the most common algorithms used to tackle this problem, but recently, neural-network based approaches have gained popularity. Most of the leading neural-network based models used for symbolic regression utilize transformer-based autoregressive models to generate an equation conditioned on encoded input points. However, autoregressive generation is limited to generating tokens left-to-right, and future generated tokens are conditioned only on previously generated tokens. Motivated by the desire to generate all tokens simultaneously to produce improved closed-form equations, we propose Symbolic Diffusion, a D3PM based discrete state-space diffusion model which simultaneously generates all tokens of the equation at once using discrete token diffusion. Using the bivariate dataset developed for SymbolicGPT, we compared our diffusion-based generation approach to an autoregressive model based on SymbolicGPT, using equivalent encoder and transformer architectures. We demonstrate that our novel approach of using diffusion-based generation for symbolic regression can offer comparable and, by some metrics, improved performance over autoregressive generation in models using similar underlying architectures, opening new research opportunities in neural-network based symbolic regression.

Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion

TL;DR

This work tackles symbolic regression by reframing equation generation as a discrete-state diffusion problem. It introduces Symbolic Diffusion, a D3PM-based model that generates all tokens of an equation in parallel, using a PointNet-style encoder and a transformer decoder, and is designed for a fair comparison with an autoregressive SymbolicGPT baseline that shares architecture. On a bivariate dataset, Symbolic Diffusion achieves competitive performance, with a statistically higher mean while autoregressive models may have higher token-accuracy at tight tolerances, demonstrating that diffusion-based generation with global context is a viable alternative for neural-symbolic regression. The study provides open-source code and highlights the potential for diffusion models to improve syntactic validity and inference speed in symbolic regression, suggesting directions for scaling to higher dimensions and end-to-end constant prediction.

Abstract

Symbolic regression refers to the task of finding a closed-form mathematical expression to fit a set of data points. Genetic programming based techniques are the most common algorithms used to tackle this problem, but recently, neural-network based approaches have gained popularity. Most of the leading neural-network based models used for symbolic regression utilize transformer-based autoregressive models to generate an equation conditioned on encoded input points. However, autoregressive generation is limited to generating tokens left-to-right, and future generated tokens are conditioned only on previously generated tokens. Motivated by the desire to generate all tokens simultaneously to produce improved closed-form equations, we propose Symbolic Diffusion, a D3PM based discrete state-space diffusion model which simultaneously generates all tokens of the equation at once using discrete token diffusion. Using the bivariate dataset developed for SymbolicGPT, we compared our diffusion-based generation approach to an autoregressive model based on SymbolicGPT, using equivalent encoder and transformer architectures. We demonstrate that our novel approach of using diffusion-based generation for symbolic regression can offer comparable and, by some metrics, improved performance over autoregressive generation in models using similar underlying architectures, opening new research opportunities in neural-network based symbolic regression.

Paper Structure

This paper contains 22 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Symbolic Diffusion Architecture
  • Figure 2: Architecture of the T-Net Encoder. $B$—batch size, $N$—number of points, $C$—input feature dimensionality, and $E$—embedding dimension.
  • Figure 3: Common transformer architecture used by both Symbolic Diffusion and SymbolicGPT. $B$—batch size, $N$—number of points, $C$—input feature dimensionality, $E$—embedding dimension, and $S$—number of tokens.