Table of Contents
Fetching ...

Interpretable Scientific Discovery with Symbolic Regression: A Review

Nour Makke, Sanjay Chawla

TL;DR

This survey surveys Symbolic Regression (SR) with a focus on interpretability, contrasting traditional black-box models and highlighting the discrete, library-driven search space that SR navigates to uncover governing equations. It contrasts linear SR, nonlinear neural-symbolic SR (e.g., Equation Learner), and tree-based approaches (GP, transformers, RL), detailing how each represents expressions and handles search, representation, and optimization. The review covers diverse applications and benchmarks (e.g., Feynman, Nguyen, SRBench), discusses current limitations such as library dependence and scaling, and argues that integrative approaches combining domain knowledge with powerful search architectures hold the most promise for data-driven scientific discovery. Overall, SR balances interpretability and predictive accuracy, with tree-based and physics-/mathematics-inspired methods offering the strongest performance, and real-data applications signaling significant future impact in physical sciences and beyond.

Abstract

Symbolic regression is emerging as a promising machine learning method for learning succinct underlying interpretable mathematical expressions directly from data. Whereas it has been traditionally tackled with genetic programming, it has recently gained a growing interest in deep learning as a data-driven model discovery method, achieving significant advances in various application domains ranging from fundamental to applied sciences. This survey presents a structured and comprehensive overview of symbolic regression methods and discusses their strengths and limitations.

Interpretable Scientific Discovery with Symbolic Regression: A Review

TL;DR

This survey surveys Symbolic Regression (SR) with a focus on interpretability, contrasting traditional black-box models and highlighting the discrete, library-driven search space that SR navigates to uncover governing equations. It contrasts linear SR, nonlinear neural-symbolic SR (e.g., Equation Learner), and tree-based approaches (GP, transformers, RL), detailing how each represents expressions and handles search, representation, and optimization. The review covers diverse applications and benchmarks (e.g., Feynman, Nguyen, SRBench), discusses current limitations such as library dependence and scaling, and argues that integrative approaches combining domain knowledge with powerful search architectures hold the most promise for data-driven scientific discovery. Overall, SR balances interpretability and predictive accuracy, with tree-based and physics-/mathematics-inspired methods offering the strongest performance, and real-data applications signaling significant future impact in physical sciences and beyond.

Abstract

Symbolic regression is emerging as a promising machine learning method for learning succinct underlying interpretable mathematical expressions directly from data. Whereas it has been traditionally tackled with genetic programming, it has recently gained a growing interest in deep learning as a data-driven model discovery method, achieving significant advances in various application domains ranging from fundamental to applied sciences. This survey presents a structured and comprehensive overview of symbolic regression methods and discusses their strengths and limitations.
Paper Structure (20 sections, 32 equations, 18 figures, 11 tables)

This paper contains 20 sections, 32 equations, 18 figures, 11 tables.

Figures (18)

  • Figure 1: (a) Example of a unary-binary tree that encodes $f(\mathrm{x}) = x_1x_2 - 2x_3$. (b) Sequence representation of the tree-like structure of $f(\mathrm{x})$.
  • Figure 2: Taxonomy based on the type of symbolic regression methods. $\phi$ denotes a neural network function, $W$ denotes the set of learnable parameters in NN. $\mathbf{x}$ denotes the input data, $\mathbf{z}$ denotes a reduced representation of $\mathbf{x}$, and $\mathbf{x}^{\prime}$ denotes a new representation of $\mathbf{x}$, e.g., by defining new features based on the original ones. $\mathcal{T}$ represents the final population of selected expression trees in genetic programming.
  • Figure 3: Schematic of the system of linear equations of Eq. \ref{['eq:yutheta']} for $f(x) = 1 + \alpha x^3$. A library matrix $\mathrm{U}(\mathrm{X})$ of nonlinear functions of the input is constructed, where $L = \{1,x,x^2,x^3, \cdots\}$. The marked entries in the $\mathrm{\Theta}$ vector denote the non-zero coefficients determining which functions of the library are active.
  • Figure 4: Result of linear SR for the Nguyen-1 benchmark, i.e., $f(x) = x+x^2+x^3$. Red points represent (test) data set. The red curve represents the true function. The blue and black dashed curves represent the learned functions using $L_1$ and $L_2$, respectively.
  • Figure 5: Two-dimensional multivariate normal distribution used in test applications.
  • ...and 13 more figures