Scalable Neural Symbolic Regression using Control Variables
Xieting Chu, Hongjue Zhao, Enze Xu, Hairong Qi, Minghan Chen, Huajie Shao
TL;DR
ScaleSR targets the scalability gap in symbolic regression for multi-variable expressions by decomposing the problem into a sequence of single-variable SR tasks guided by a DNN-learned data generator and control-variable sampling. It learns a data model $f(oldsymbol{x})$, generates variable-specific data via control variables, applies single-variable SR to estimate each variable's expression, and iteratively incorporates additional variables to assemble the final multivariate equation. The approach yields a significant reduction in search space and demonstrates superior recovery rates on SR benchmarks Nguyen and Jin, and accurate governing equations for gene regulatory networks, outperforming baselines in both accuracy and efficiency. This work advances scalable, interpretable modeling for scientific discovery and provides a public code release.
Abstract
Symbolic regression (SR) is a powerful technique for discovering the analytical mathematical expression from data, finding various applications in natural sciences due to its good interpretability of results. However, existing methods face scalability issues when dealing with complex equations involving multiple variables. To address this challenge, we propose ScaleSR, a scalable symbolic regression model that leverages control variables to enhance both accuracy and scalability. The core idea is to decompose multi-variable symbolic regression into a set of single-variable SR problems, which are then combined in a bottom-up manner. The proposed method involves a four-step process. First, we learn a data generator from observed data using deep neural networks (DNNs). Second, the data generator is used to generate samples for a certain variable by controlling the input variables. Thirdly, single-variable symbolic regression is applied to estimate the corresponding mathematical expression. Lastly, we repeat steps 2 and 3 by gradually adding variables one by one until completion. We evaluate the performance of our method on multiple benchmark datasets. Experimental results demonstrate that the proposed ScaleSR significantly outperforms state-of-the-art baselines in discovering mathematical expressions with multiple variables. Moreover, it can substantially reduce the search space for symbolic regression. The source code will be made publicly available upon publication.
