Table of Contents
Fetching ...

Scalable Neural Symbolic Regression using Control Variables

Xieting Chu, Hongjue Zhao, Enze Xu, Hairong Qi, Minghan Chen, Huajie Shao

TL;DR

ScaleSR targets the scalability gap in symbolic regression for multi-variable expressions by decomposing the problem into a sequence of single-variable SR tasks guided by a DNN-learned data generator and control-variable sampling. It learns a data model $f(oldsymbol{x})$, generates variable-specific data via control variables, applies single-variable SR to estimate each variable's expression, and iteratively incorporates additional variables to assemble the final multivariate equation. The approach yields a significant reduction in search space and demonstrates superior recovery rates on SR benchmarks Nguyen and Jin, and accurate governing equations for gene regulatory networks, outperforming baselines in both accuracy and efficiency. This work advances scalable, interpretable modeling for scientific discovery and provides a public code release.

Abstract

Symbolic regression (SR) is a powerful technique for discovering the analytical mathematical expression from data, finding various applications in natural sciences due to its good interpretability of results. However, existing methods face scalability issues when dealing with complex equations involving multiple variables. To address this challenge, we propose ScaleSR, a scalable symbolic regression model that leverages control variables to enhance both accuracy and scalability. The core idea is to decompose multi-variable symbolic regression into a set of single-variable SR problems, which are then combined in a bottom-up manner. The proposed method involves a four-step process. First, we learn a data generator from observed data using deep neural networks (DNNs). Second, the data generator is used to generate samples for a certain variable by controlling the input variables. Thirdly, single-variable symbolic regression is applied to estimate the corresponding mathematical expression. Lastly, we repeat steps 2 and 3 by gradually adding variables one by one until completion. We evaluate the performance of our method on multiple benchmark datasets. Experimental results demonstrate that the proposed ScaleSR significantly outperforms state-of-the-art baselines in discovering mathematical expressions with multiple variables. Moreover, it can substantially reduce the search space for symbolic regression. The source code will be made publicly available upon publication.

Scalable Neural Symbolic Regression using Control Variables

TL;DR

ScaleSR targets the scalability gap in symbolic regression for multi-variable expressions by decomposing the problem into a sequence of single-variable SR tasks guided by a DNN-learned data generator and control-variable sampling. It learns a data model , generates variable-specific data via control variables, applies single-variable SR to estimate each variable's expression, and iteratively incorporates additional variables to assemble the final multivariate equation. The approach yields a significant reduction in search space and demonstrates superior recovery rates on SR benchmarks Nguyen and Jin, and accurate governing equations for gene regulatory networks, outperforming baselines in both accuracy and efficiency. This work advances scalable, interpretable modeling for scientific discovery and provides a public code release.

Abstract

Symbolic regression (SR) is a powerful technique for discovering the analytical mathematical expression from data, finding various applications in natural sciences due to its good interpretability of results. However, existing methods face scalability issues when dealing with complex equations involving multiple variables. To address this challenge, we propose ScaleSR, a scalable symbolic regression model that leverages control variables to enhance both accuracy and scalability. The core idea is to decompose multi-variable symbolic regression into a set of single-variable SR problems, which are then combined in a bottom-up manner. The proposed method involves a four-step process. First, we learn a data generator from observed data using deep neural networks (DNNs). Second, the data generator is used to generate samples for a certain variable by controlling the input variables. Thirdly, single-variable symbolic regression is applied to estimate the corresponding mathematical expression. Lastly, we repeat steps 2 and 3 by gradually adding variables one by one until completion. We evaluate the performance of our method on multiple benchmark datasets. Experimental results demonstrate that the proposed ScaleSR significantly outperforms state-of-the-art baselines in discovering mathematical expressions with multiple variables. Moreover, it can substantially reduce the search space for symbolic regression. The source code will be made publicly available upon publication.
Paper Structure (25 sections, 6 equations, 5 figures, 7 tables, 2 algorithms)

This paper contains 25 sections, 6 equations, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: The overall framework of ScaleSR, consisting of three main components: i) learn a data generator using DNNs; ii) generate data for each independent variable via control variables; iii) apply single-variable SR to estimate the mathematical equation for the current independent variable.
  • Figure 2: The framework of data generation with control variables. Specifically, we generate a group of data points for a newly added variable $x_{i+1}$ assigned with a random value and then vary the previously learned variables while fixing other control variables. By choosing $K$ different values for the current variable $x_{i+1}$, we can generate $K$ groups of data samples.
  • Figure 3: The framework of single-variable SR. It is performed in two steps: 1) we first use an optimization method, such as BFGS, to estimate $K$ groups of coefficients for the current independent variable; 2) we then use a single-variable SR model to estimate coefficients in the mathematical equation related to the current variable.
  • Figure 4: The relationship between complexity and search space for different methods based on $1000$ equations with different complexity.
  • Figure 5: Trajectory prediction of the genetic toggle switch and the repressilator using ScaleSR. The ScaleSR can precisely predict their trajectories, closely matching the ground truth.

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3