Table of Contents
Fetching ...

SymbolicGPT: A Generative Transformer Model for Symbolic Regression

Mojtaba Valipour, Bowen You, Maysum Panju, Ali Ghodsi

TL;DR

Symbolic regression aims to discover closed-form expressions fitting data, but traditional approaches struggle with search space size and per-instance training costs. The paper introduces SymbolicGPT, a GPT-based framework that first computes an order-invariant embedding of input data via a T-net, then generates a skeleton of a symbolic equation, and finally fills in constants with a BFGS optimization step. Empirical results show faster inference and competitive accuracy across multi-variable settings compared to Deep Symbolic Regression, Genetic Programming, and MLP baselines, with robustness to varying data sizes. The approach represents a scalable, one-time-training paradigm that leverages advances in GPT technology for symbolic reasoning tasks.

Abstract

Symbolic regression is the task of identifying a mathematical expression that best fits a provided dataset of input and output values. Due to the richness of the space of mathematical expressions, symbolic regression is generally a challenging problem. While conventional approaches based on genetic evolution algorithms have been used for decades, deep learning-based methods are relatively new and an active research area. In this work, we present SymbolicGPT, a novel transformer-based language model for symbolic regression. This model exploits the advantages of probabilistic language models like GPT, including strength in performance and flexibility. Through comprehensive experiments, we show that our model performs strongly compared to competing models with respect to the accuracy, running time, and data efficiency.

SymbolicGPT: A Generative Transformer Model for Symbolic Regression

TL;DR

Symbolic regression aims to discover closed-form expressions fitting data, but traditional approaches struggle with search space size and per-instance training costs. The paper introduces SymbolicGPT, a GPT-based framework that first computes an order-invariant embedding of input data via a T-net, then generates a skeleton of a symbolic equation, and finally fills in constants with a BFGS optimization step. Empirical results show faster inference and competitive accuracy across multi-variable settings compared to Deep Symbolic Regression, Genetic Programming, and MLP baselines, with robustness to varying data sizes. The approach represents a scalable, one-time-training paradigm that leverages advances in GPT technology for symbolic reasoning tasks.

Abstract

Symbolic regression is the task of identifying a mathematical expression that best fits a provided dataset of input and output values. Due to the richness of the space of mathematical expressions, symbolic regression is generally a challenging problem. While conventional approaches based on genetic evolution algorithms have been used for decades, deep learning-based methods are relatively new and an active research area. In this work, we present SymbolicGPT, a novel transformer-based language model for symbolic regression. This model exploits the advantages of probabilistic language models like GPT, including strength in performance and flexibility. Through comprehensive experiments, we show that our model performs strongly compared to competing models with respect to the accuracy, running time, and data efficiency.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The architecture of SymbolicGPT. The left box illustrates the structure of our order-invariant T-net for obtaining a vector representation of the input dataset, and the right box shows the structure of the GPT language model for producing symbolic equation skeletons.
  • Figure 2: Cumulative $\log MSE_N$ over all methods and experiments. Each curve shows the proportion of test cases that attained an error score less than every given threshold. SymbolicGPT finds better fitting equations for more test cases than DSR and finds more highly accurate equations (with $\log MSE_N < -10$) than any other method tested.
  • Figure 3: The effect of the number of points on the performance of the model.
  • Figure 4: Graphical representations of selected equations of one input variable. The solid blue curves are the graphs of the true underlying equations; the orange dotted curves are the predicted functions as generated by SymbolicGPT.