Table of Contents
Fetching ...

A Neural Symbolic Model for Space Physics

Jie Ying, Haowei Lin, Chao Yue, Yajie Chen, Chao Xiao, Quanqi Shi, Yitao Liang, Shing-Tung Yau, Yuan Zhou, Jianzhu Ma

TL;DR

PhyE2E presents a neural-symbolic framework for automated discovery of physical laws from observational data by integrating LLM-synthesized physics formulas, a transformer-based end-to-end formula regression, and a Hessian-guided divide-and-conquer decomposition, followed by MCTS and GP refinement. The method achieves state-of-the-art symbolic accuracy and unit-consistency on both synthetic AI-Feynman datasets and diverse real space-physics applications, including sunspot numbers, plasma pressure, solar differential rotation, emission-line contributions, and lunar-tide signals, often with substantially simpler, interpretable formulas. A key advance is the explicit integration of physical priors, especially units, into the model and its outputs, enabling unit-consistent formulas without heavy retuning. The work demonstrates robust generalization to long-term solar cycles and multiple space-physics phenomena, and provides data and code to enable broader application of neural-symbolic symbolic regression to scientific discovery.

Abstract

In this study, we unveil a new AI model, termed PhyE2E, to discover physical formulas through symbolic regression. PhyE2E simplifies symbolic regression by decomposing it into sub-problems using the second-order derivatives of an oracle neural network, and employs a transformer model to translate data into symbolic formulas in an end-to-end manner. The resulting formulas are refined through Monte-Carlo Tree Search and Genetic Programming. We leverage a large language model to synthesize extensive symbolic expressions resembling real physics, and train the model to recover these formulas directly from data. A comprehensive evaluation reveals that PhyE2E outperforms existing state-of-the-art approaches, delivering superior symbolic accuracy, precision in data fitting, and consistency in physical units. We deployed PhyE2E to five applications in space physics, including the prediction of sunspot numbers, solar rotational angular velocity, emission line contribution functions, near-Earth plasma pressure, and lunar-tide plasma signals. The physical formulas generated by AI demonstrate a high degree of accuracy in fitting the experimental data from satellites and astronomical telescopes. We have successfully upgraded the formula proposed by NASA in 1993 regarding solar activity, and for the first time, provided the explanations for the long cycle of solar activity in an explicit form. We also found that the decay of near-Earth plasma pressure is proportional to r^2 to Earth, where subsequent mathematical derivations are consistent with satellite data from another independent study. Moreover, we found physical formulas that can describe the relationships between emission lines in the extreme ultraviolet spectrum of the Sun, temperatures, electron densities, and magnetic fields. The formula obtained is consistent with the properties that physicists had previously hypothesized it should possess.

A Neural Symbolic Model for Space Physics

TL;DR

PhyE2E presents a neural-symbolic framework for automated discovery of physical laws from observational data by integrating LLM-synthesized physics formulas, a transformer-based end-to-end formula regression, and a Hessian-guided divide-and-conquer decomposition, followed by MCTS and GP refinement. The method achieves state-of-the-art symbolic accuracy and unit-consistency on both synthetic AI-Feynman datasets and diverse real space-physics applications, including sunspot numbers, plasma pressure, solar differential rotation, emission-line contributions, and lunar-tide signals, often with substantially simpler, interpretable formulas. A key advance is the explicit integration of physical priors, especially units, into the model and its outputs, enabling unit-consistent formulas without heavy retuning. The work demonstrates robust generalization to long-term solar cycles and multiple space-physics phenomena, and provides data and code to enable broader application of neural-symbolic symbolic regression to scientific discovery.

Abstract

In this study, we unveil a new AI model, termed PhyE2E, to discover physical formulas through symbolic regression. PhyE2E simplifies symbolic regression by decomposing it into sub-problems using the second-order derivatives of an oracle neural network, and employs a transformer model to translate data into symbolic formulas in an end-to-end manner. The resulting formulas are refined through Monte-Carlo Tree Search and Genetic Programming. We leverage a large language model to synthesize extensive symbolic expressions resembling real physics, and train the model to recover these formulas directly from data. A comprehensive evaluation reveals that PhyE2E outperforms existing state-of-the-art approaches, delivering superior symbolic accuracy, precision in data fitting, and consistency in physical units. We deployed PhyE2E to five applications in space physics, including the prediction of sunspot numbers, solar rotational angular velocity, emission line contribution functions, near-Earth plasma pressure, and lunar-tide plasma signals. The physical formulas generated by AI demonstrate a high degree of accuracy in fitting the experimental data from satellites and astronomical telescopes. We have successfully upgraded the formula proposed by NASA in 1993 regarding solar activity, and for the first time, provided the explanations for the long cycle of solar activity in an explicit form. We also found that the decay of near-Earth plasma pressure is proportional to r^2 to Earth, where subsequent mathematical derivations are consistent with satellite data from another independent study. Moreover, we found physical formulas that can describe the relationships between emission lines in the extreme ultraviolet spectrum of the Sun, temperatures, electron densities, and magnetic fields. The formula obtained is consistent with the properties that physicists had previously hypothesized it should possess.

Paper Structure

This paper contains 55 sections, 5 theorems, 30 equations, 10 figures, 20 tables, 1 algorithm.

Key Result

Lemma 1

Let the uni-variate operator $\sigma : \mathbb{R} \to \mathbb{R}$ and the target formula $f: \mathbb{R}^n \to \mathbb{R}$ be twice differentiable. Suppose $\sigma$ is strictly monotonic, then two features $i, j \in \{1, 2, \dots, n\}$ are $\sigma$-separable if and only if for all $\bm{x} \in \mathbb

Figures (10)

  • Figure 1: The overall PhyE2E framework.Top. The training dataset was augmented with a large-scale synthetic dataset generated by a large language model. Middle. A variable interaction technique was integrated to decompose the original symbolic regression problem into simpler sub-problems, referred to as Divide-and-Conquer(D&C). An end-to-end model was trained to predict the target formula using observed data points and prior physical knowledge (referred to as "physical priors"). Bottom. Monte Carlo Tree Search (MCTS) module was adopted to refine the generated formulas, using a context-free grammar pool that includes atomic formulas and the end-to-end generated formula.
  • Figure 2: Performance on the synthetic and AI Feynman datasets.a, Comparison between the formulas generated from LLaMa2 and the Feynman Dataset. The distance between the distributions of different properties of the two sets of formulas is measured using the Jensen-Shannon divergence (D$_{\text{JS}}$). b,c, Evaluation results for Symbolic Regression methods on the test set of the synthetic dataset and AI Feynman dataset, respectively. Data are presented as mean values ± SEM (n=5 individual trials for each baselines). d, Evaluation results on formulas with different complexity (upper panels) and different difficulties (bottom panels) on the synthetic and AI Feynman datasets. The bar plots represent mean values ± SEM (n=5 individual trials for each baselines).
  • Figure 3: Performance of sunspot intensity predictiona, Sunspot variation over time. b, Variations in SSN observed through telescopes from 1755 to 2020 and the formula derived by Hathaway et al., 1994. c, The PhyE2E formula and the variations in SSN yielded by the formula from 1855 to 1976 (top). The formulas generated by other baseline models and the variations in SNN yielded by these formulas from 1855 to 1976 (bottom). d, Avg-R (left) and Multi-R (right) on the test data from 1976 to 2019 for different baseline models. e, Solar modulation level and smoothed SSN from different baseline models over a longer time frame from 980 to 1976. f, Pearson Correlation between the SSN observed by telescopes, SSN predicted by the generated formulas, and Solar Modulation level from 980 to 1932.
  • Figure 4: Performance of plasma sheet pressure prediction and solar differential rotation predictiona, The distribution of near-Earth magnetosphere and plasmasheet. b, Symbolic formulas of Wang et al., 2013 and PhyE2E. c, Average Mean Square Error (left) and complexity (right) when utilizing data from different radius for models to be compared. d, Instrumental observations and formula predictions for plasma sheet pressure using different models. e, Predictions for plasma sheet pressure using data from different radius by PhyE2E. f, Solar rotation varies at different latitudes, making magnetic field lines stretched and twisted. g, MSE and complexity from different models using different numbers of training data. h, Predictions from Snodgrass et al., 1993 and PhyE2E across all the latitudes. i, Predictions of solar atmosphere, using data from various spectral lines in the photosphere and the chromosphere. j, PhyE2E predicts consistent formulas with high robustness across various spectral lines.
  • Figure 5: Performance of contribution function of emission lines predictions and lunar tide signal of plasma layer predictionsa, Emission lines in the extreme ultraviolet spectrum of the Sun. b, Average MSE for Fe X 174 and Fe X 175 (left), MSE of the ratio between the two emission lines (middle), and the complexity (right) of the formulas generated by different models to be compared. c, Instrumental measured contribution function and PhyE2E predictions for Fe X 174 and Fe X 175. d, Instrumental measured ratio of the two emission lines and PhyE2E predictions. e, Tidal radial electric fields influences the Earth's magnetospheric electric fields. f, MSE and complexity for different models to be compared. g, Instrumental measured radial electric field ($E_r$) (left) and PhyE2E predictions for dayside and nightside of the Earth.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Definition 1
  • Lemma 1
  • Definition 2
  • Lemma 2
  • Theorem 3
  • proof : Proof of Lemma 1
  • proof : Proof of Lemma 2
  • Lemma 4
  • proof
  • Lemma 5
  • ...and 2 more