Table of Contents
Fetching ...

SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond

Madhav R. Muthyala, Farshud Sorourifar, You Peng, Joel A. Paulson

TL;DR

SyMANTIC tackles the challenge of discovering interpretable, parsimonious symbolic expressions from high-dimensional data by combining mutual-information feature screening, a complexity-aware library of expanded features, and a complexity-constrained SISSO (C$^2$-SISSO) approach to efficiently search near the Pareto frontier of accuracy and simplicity. The method, implemented in PyTorch with GPU acceleration, automatically tunes hyperparameters and produces a set of approximate Pareto-optimal models rather than a single solution. Across synthetic benchmarks, dynamical and material-property problems, SyMANTIC achieves superior recovery rates, lower structural complexity, and faster runtimes than state-of-the-art SR methods, even under noise and limited data. This yields practical, interpretable models suitable for scientific discovery and real-world applications, with the open-source package enabling easy adoption and extension.

Abstract

Symbolic regression (SR) is an emerging branch of machine learning focused on discovering simple and interpretable mathematical expressions from data. Although a wide-variety of SR methods have been developed, they often face challenges such as high computational cost, poor scalability with respect to the number of input dimensions, fragility to noise, and an inability to balance accuracy and complexity. This work introduces SyMANTIC, a novel SR algorithm that addresses these challenges. SyMANTIC efficiently identifies (potentially several) low-dimensional descriptors from a large set of candidates (from $\sim 10^5$ to $\sim 10^{10}$ or more) through a unique combination of mutual information-based feature selection, adaptive feature expansion, and recursively applied $\ell_0$-based sparse regression. In addition, it employs an information-theoretic measure to produce an approximate set of Pareto-optimal equations, each offering the best-found accuracy for a given complexity. Furthermore, our open-source implementation of SyMANTIC, built on the PyTorch ecosystem, facilitates easy installation and GPU acceleration. We demonstrate the effectiveness of SyMANTIC across a range of problems, including synthetic examples, scientific benchmarks, real-world material property predictions, and chaotic dynamical system identification from small datasets. Extensive comparisons show that SyMANTIC uncovers similar or more accurate models at a fraction of the cost of existing SR methods.

SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond

TL;DR

SyMANTIC tackles the challenge of discovering interpretable, parsimonious symbolic expressions from high-dimensional data by combining mutual-information feature screening, a complexity-aware library of expanded features, and a complexity-constrained SISSO (C-SISSO) approach to efficiently search near the Pareto frontier of accuracy and simplicity. The method, implemented in PyTorch with GPU acceleration, automatically tunes hyperparameters and produces a set of approximate Pareto-optimal models rather than a single solution. Across synthetic benchmarks, dynamical and material-property problems, SyMANTIC achieves superior recovery rates, lower structural complexity, and faster runtimes than state-of-the-art SR methods, even under noise and limited data. This yields practical, interpretable models suitable for scientific discovery and real-world applications, with the open-source package enabling easy adoption and extension.

Abstract

Symbolic regression (SR) is an emerging branch of machine learning focused on discovering simple and interpretable mathematical expressions from data. Although a wide-variety of SR methods have been developed, they often face challenges such as high computational cost, poor scalability with respect to the number of input dimensions, fragility to noise, and an inability to balance accuracy and complexity. This work introduces SyMANTIC, a novel SR algorithm that addresses these challenges. SyMANTIC efficiently identifies (potentially several) low-dimensional descriptors from a large set of candidates (from to or more) through a unique combination of mutual information-based feature selection, adaptive feature expansion, and recursively applied -based sparse regression. In addition, it employs an information-theoretic measure to produce an approximate set of Pareto-optimal equations, each offering the best-found accuracy for a given complexity. Furthermore, our open-source implementation of SyMANTIC, built on the PyTorch ecosystem, facilitates easy installation and GPU acceleration. We demonstrate the effectiveness of SyMANTIC across a range of problems, including synthetic examples, scientific benchmarks, real-world material property predictions, and chaotic dynamical system identification from small datasets. Extensive comparisons show that SyMANTIC uncovers similar or more accurate models at a fraction of the cost of existing SR methods.

Paper Structure

This paper contains 44 sections, 15 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Schematic illustration of our proposed SyMANTIC algorithm. It takes in training data in the form of several $(\boldsymbol{x}_i, y_i)$ pairs and a user-defined operator set, with the default choice shown in \ref{['eq:operator-set-default']}. The training data is fed to a mutual information pre-screening method (Step 1) to limit the number of primary features sent to feature expansion (Step 2). The feature expansion step is carried out recursively over some number of levels $l$. The expanded features, along with a complexity filter parameter $\lambda_c$, are input to our C$^2$-SISSO algorithm (Step 3) whose details are summarized in Figure \ref{['fig:c2sisso-flowchart']}. C$^2$-SISSO outputs a new set of tested models $\mathcal{M}$ with corresponding loss $\mathcal{L}$ and complexity $\mathcal{C}$ values. Using the new $\mathcal{L}$ and $\mathcal{C}$, the current approximation to the Pareto front is updated. We run an automated procedure over several $l$ and $c$ values (Step 4) to recursively improve the Pareto front. Note that the algorithm will stop and return the current Pareto front when either a threshold on the root mean squared error is satisfied or all hyperparameter combinations are tried (Exit Condition).
  • Figure 2: Schematic illustration of C$^2$-SISSO that takes as inputs features, measured outputs, and a complexity cutoff parameter and returns symbolic expression, loss, and complexity values for a set of tested models. The models are trained using $\ell_0$ regression to identify the best $t$ term model over a subset of features sequentially identified using sure independence screening (SIS) applied to the residual of the previous model. The process is repeated until a maximum number of terms $T$, which is set by the user.
  • Figure 3: Results on the full set of test equations in Table \ref{['tab:benchmark-equations']} for all algorithms. Left shows the percent of ground-truth equations recovered, middle shows the median model complexity across all test problems, and right shows the median training time across all test problems. Points indicate the mean of the metrics across the 5 replicates while bars show the estimated 95% confidence intervals. The dotted line in the training time plot represents the 5 minute time limit imposed per problem on all methods.
  • Figure 4: Approximate Pareto fronts between root mean squared error (RMSE) and structural complexity found by SyMANTIC and PySR on the relativistic momentum problem.
  • Figure 5: Normalized root mean squared error (NRMSE) versus the percentage of noise added to the measured target values (referenced as parameter $v$ in the text) for the Rydberg formula for 10 independent replicates for SyMANTIC and PySR.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2