Table of Contents
Fetching ...

Symbolic Regression as Feature Engineering Method for Machine and Deep Learning Regression Tasks

Assaf Shmuel, Oren Glickman, Teddy Lazebnik

TL;DR

This work addresses the challenge of feature engineering in regression by injecting symbolic regression (SR) as an upfront FE step to generate informative features $X^*$ before training ML/DL models. Using GPlearn for SR and AutoML tools TPOT and AutoKeras, the authors show substantial RMSE improvements across synthetic and real-world datasets, including a physics-focused superconducting Tc case with over 20% gains. The findings demonstrate SR-based FE can augment both traditional ML and DL pipelines, often outperforming baselines and exhibiting robustness to varying sample sizes and noise—especially in nonlinear settings. The study suggests SR FE as a practical, interpretable enhancement that reduces reliance on manual feature crafting and can extend to broader domains.

Abstract

In the realm of machine and deep learning regression tasks, the role of effective feature engineering (FE) is pivotal in enhancing model performance. Traditional approaches of FE often rely on domain expertise to manually design features for machine learning models. In the context of deep learning models, the FE is embedded in the neural network's architecture, making it hard for interpretation. In this study, we propose to integrate symbolic regression (SR) as an FE process before a machine learning model to improve its performance. We show, through extensive experimentation on synthetic and real-world physics-related datasets, that the incorporation of SR-derived features significantly enhances the predictive capabilities of both machine and deep learning regression models with 34-86% root mean square error (RMSE) improvement in synthetic datasets and 4-11.5% improvement in real-world datasets. In addition, as a realistic use-case, we show the proposed method improves the machine learning performance in predicting superconducting critical temperatures based on Eliashberg theory by more than 20% in terms of RMSE. These results outline the potential of SR as an FE component in data-driven models.

Symbolic Regression as Feature Engineering Method for Machine and Deep Learning Regression Tasks

TL;DR

This work addresses the challenge of feature engineering in regression by injecting symbolic regression (SR) as an upfront FE step to generate informative features before training ML/DL models. Using GPlearn for SR and AutoML tools TPOT and AutoKeras, the authors show substantial RMSE improvements across synthetic and real-world datasets, including a physics-focused superconducting Tc case with over 20% gains. The findings demonstrate SR-based FE can augment both traditional ML and DL pipelines, often outperforming baselines and exhibiting robustness to varying sample sizes and noise—especially in nonlinear settings. The study suggests SR FE as a practical, interpretable enhancement that reduces reliance on manual feature crafting and can extend to broader domains.

Abstract

In the realm of machine and deep learning regression tasks, the role of effective feature engineering (FE) is pivotal in enhancing model performance. Traditional approaches of FE often rely on domain expertise to manually design features for machine learning models. In the context of deep learning models, the FE is embedded in the neural network's architecture, making it hard for interpretation. In this study, we propose to integrate symbolic regression (SR) as an FE process before a machine learning model to improve its performance. We show, through extensive experimentation on synthetic and real-world physics-related datasets, that the incorporation of SR-derived features significantly enhances the predictive capabilities of both machine and deep learning regression models with 34-86% root mean square error (RMSE) improvement in synthetic datasets and 4-11.5% improvement in real-world datasets. In addition, as a realistic use-case, we show the proposed method improves the machine learning performance in predicting superconducting critical temperatures based on Eliashberg theory by more than 20% in terms of RMSE. These results outline the potential of SR as an FE component in data-driven models.
Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A schematic view of this method and experiment. We begin by either generating synthetic datasets or using real-world datasets. We then train the SR model and create the SR-based feature. Next, we train AutoML models on both the SR-enhanced data as the case sample, and the original data as control. We compare the RMSE scores in the testing data of both models.
  • Figure 2: Summary of model performances. (a) and (b) illustrate the relative improvement of the SRTPOT and SRAK models compared to the TPOT and AK models, respectively, in the synthetic datasets. (c) and (d) illustrate the relative improvement of the SRTPOT and SRAK models compared to the TPOT and AK models, respectively, in the real datasets. Values above 100% are not presented, encompassing approximately 25% of the observations in subfigures (a) and (b), 1% of the observations in subfigure (c) and 3% of the observations in subfigure (d).
  • Figure 3: Summary of model performances predicting superconducting critical temperature. (a) and (b) illustrate the relative improvement of the SRTPOT and SRAK models compared to the TPOT and AK models, respectively.
  • Figure 4: Robustness tests for synthetic data noise and sample size. (a) illustrates the relative improvement of the SRTPOT model compared to the TPOT model. (b) illustrated the relative improvement of the SRAK model compared to the AK model.
  • Figure 5: Robustness tests for synthetic data non-linearity. In subfigures (a) and (b), non-linearity is defined as the LR RMSE divided by the standard deviation of the target variable in each dataset. (a) illustrates the relative improvement of the SRTPOT model compared to the TPOT model. (b) illustrates the relative improvement of the SRAK model compared to the AK model. In subfigures (c) and (d), non-linearity is defined as $1-R^2$ of the LR model in each dataset. (c) illustrates the relative improvement of the SRTPOT model compared to the TPOT model. (d) illustrates the relative improvement of the SRAK model compared to the AK model.