Table of Contents
Fetching ...

Symbolic Foundation Regressor on Complex Networks

Weiting Liu, Jiaxu Cui, Jiao Hu, En Wang, Bo Yang

TL;DR

This work introduces a pre-trained symbolic foundation regressor that can effectively compress complex data with numerous interacting variables while producing interpretable physical representations and provides a foundational solution for revealing the hidden mechanisms behind changes in complex phenomena, enhancing interpretability, and inspiring further scientific discoveries.

Abstract

In science, we are interested not only in forecasting but also in understanding how predictions are made, specifically what the interpretable underlying model looks like. Data-driven machine learning technology can significantly streamline the complex and time-consuming traditional manual process of discovering scientific laws, helping us gain insights into fundamental issues in modern science. In this work, we introduce a pre-trained symbolic foundation regressor that can effectively compress complex data with numerous interacting variables while producing interpretable physical representations. Our model has been rigorously tested on non-network symbolic regression, symbolic regression on complex networks, and the inference of network dynamics across various domains, including physics, biochemistry, ecology, and epidemiology. The results indicate a remarkable improvement in equation inference efficiency, being three times more effective than baseline approaches while maintaining accurate predictions. Furthermore, we apply our model to uncover more intuitive laws of interaction transmission from global epidemic outbreak data, achieving optimal data fitting. This model extends the application boundary of pre-trained symbolic regression models to complex networks, and we believe it provides a foundational solution for revealing the hidden mechanisms behind changes in complex phenomena, enhancing interpretability, and inspiring further scientific discoveries.

Symbolic Foundation Regressor on Complex Networks

TL;DR

This work introduces a pre-trained symbolic foundation regressor that can effectively compress complex data with numerous interacting variables while producing interpretable physical representations and provides a foundational solution for revealing the hidden mechanisms behind changes in complex phenomena, enhancing interpretability, and inspiring further scientific discoveries.

Abstract

In science, we are interested not only in forecasting but also in understanding how predictions are made, specifically what the interpretable underlying model looks like. Data-driven machine learning technology can significantly streamline the complex and time-consuming traditional manual process of discovering scientific laws, helping us gain insights into fundamental issues in modern science. In this work, we introduce a pre-trained symbolic foundation regressor that can effectively compress complex data with numerous interacting variables while producing interpretable physical representations. Our model has been rigorously tested on non-network symbolic regression, symbolic regression on complex networks, and the inference of network dynamics across various domains, including physics, biochemistry, ecology, and epidemiology. The results indicate a remarkable improvement in equation inference efficiency, being three times more effective than baseline approaches while maintaining accurate predictions. Furthermore, we apply our model to uncover more intuitive laws of interaction transmission from global epidemic outbreak data, achieving optimal data fitting. This model extends the application boundary of pre-trained symbolic regression models to complex networks, and we believe it provides a foundational solution for revealing the hidden mechanisms behind changes in complex phenomena, enhancing interpretability, and inspiring further scientific discoveries.

Paper Structure

This paper contains 25 sections, 34 figures, 19 tables, 1 algorithm.

Figures (34)

  • Figure 1: a. Human beings can condense observations into scientific laws, and then use these compressed laws to analyze and regulate various systems. We attempt to imitate this process of compressing interpretable physical representations in human learning through machine learning. b. The overall process of our Symbolic Foundation Regressor (SFR), including the generation of massive high-quality synthetic data-equation pairs, model architecture with dual branches, and the pre-training process. c. After pre-training the SFR, we can effectively derive the target equation for the unseen downstream task through a single forward propagation. Additionally, data pre-processing and equation post-processing are included to enhance the accuracy of the recovery equation.
  • Figure 2: Results on classical non-network symbolic regression tasks. a. Comparison of the execution accuracy ($R^2$, $Close_{0.001}$) from various methods (PySR, SINDy, NeSymRes, E2E, and Ours) on $2$ datasets (AI-Feynman and USE-F). b. Comparison of the execution accuracy from various methods for equations with different lengths in USE-F. c. The influence of the number of test data points on the results. d. A physical equation from the AI-Feynman dataset that describes the relationship between the modulus of rigidity $G$, modulus of elasticity $E$, and Poisson's ratio $\mu$ in material science for regression analysis. our SFR can reconstruct the equation closest to ground truth with the same amount of data, demonstrating the applicability and potential of our model in classical symbolic regression tasks.
  • Figure 3: Results of symbolic regression on complex networks. a. Comparison of the performance ($R^2$ and $Close_p$) of our model on USE with different topologies (Grid, Random, Power Law, Small World and Community). b. Comparison of the performance of our model on equations with different lengths and dimensions. c. The impact of the number of test data points on equations with different complexity. d. Data representations (denoted as $h$) generated by equations with various characteristics are visualized through projection using t-SNE. e. A specific example of symbolic regression from USE, demonstrating the ability of our model to regress high-precision equation on complex networks through local observations.
  • Figure 4: Results on inferring interpretable network dynamics. a Comparison of the performance ($Close_p$, $R^2$) for reconstructing dynamics from six scenarios, including Epidemic (Epi), Biochemical (Bio), Lotka-Volterra (LV), Mutualistic interaction (Mutu), Heat diffusion (Heat), and Gene regulatory (Gene) dynamics. b. Comparison of the average execution time across all dynamics for various methods. c. The MAPE (Mean Absolute Percentage Error) between the predictive results produced by the discovered governing equations and ground truth in the LV scenario with four communities, and comparison of state prediction curves on selected nodes, where $T_R$ and $T_P$ are the termination times of IN-Domain and OUT-Domain respectively. d. Comparison of governing equations inferred by various methods.
  • Figure 5: Results on heterogeneous epidemic transmission in communities. a. Four heterogeneous transmission equations by assigning different recovery rates ($\delta$) in the epidemic equation, where $x_{i,0}:=I_{i}$ means the probability of an individual $i$ being susceptible. b. Comparison of the state prediction curves generated by the governing equations inferred from observations at sampling nodes within each community. Our SFR has successfully recovered the phenomena exhibited by heterogeneous transmission equations.
  • ...and 29 more figures