Table of Contents
Fetching ...

Active Learning in Symbolic Regression with Physical Constraints

Jorge Medina, Andrew D. White

TL;DR

This work addresses data efficiency in symbolic regression by integrating active learning and soft physical constraints. The authors implement a Query by Committee strategy with a Pareto-frontier of candidate equations and embed physical knowledge as regularization terms, enabling rediscovery of known equations with far less data than traditional SR. Across gravity, Feynman benchmarks, robustness to noise, and a Shewanella growth case, the approach reduces data needs, improves interpretability, and yields physically meaningful relationships, including Gompertz-like growth parameterizations. The framework offers a practical, physics-informed pathway for data-efficient equation discovery with broad applicability and accessible code/data.

Abstract

Evolutionary symbolic regression (SR) fits a symbolic equation to data, which gives a concise interpretable model. We explore using SR as a method to propose which data to gather in an active learning setting with physical constraints. SR with active learning proposes which experiments to do next. Active learning is done with query by committee, where the Pareto frontier of equations is the committee. The physical constraints improve proposed equations in very low data settings. These approaches reduce the data required for SR and achieves state of the art results in data required to rediscover known equations.

Active Learning in Symbolic Regression with Physical Constraints

TL;DR

This work addresses data efficiency in symbolic regression by integrating active learning and soft physical constraints. The authors implement a Query by Committee strategy with a Pareto-frontier of candidate equations and embed physical knowledge as regularization terms, enabling rediscovery of known equations with far less data than traditional SR. Across gravity, Feynman benchmarks, robustness to noise, and a Shewanella growth case, the approach reduces data needs, improves interpretability, and yields physically meaningful relationships, including Gompertz-like growth parameterizations. The framework offers a practical, physics-informed pathway for data-efficient equation discovery with broad applicability and accessible code/data.

Abstract

Evolutionary symbolic regression (SR) fits a symbolic equation to data, which gives a concise interpretable model. We explore using SR as a method to propose which data to gather in an active learning setting with physical constraints. SR with active learning proposes which experiments to do next. Active learning is done with query by committee, where the Pareto frontier of equations is the committee. The physical constraints improve proposed equations in very low data settings. These approaches reduce the data required for SR and achieves state of the art results in data required to rediscover known equations.
Paper Structure (19 sections, 18 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 18 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Query By Committee Depiction. 1) Symbolic Regression with physical constraints 2) outputs a Pareto frontier of equations that act as expert models, and measures disagreement from unlabeled data to 3) select the most informative point to 4) label and iterate.
  • Figure 2: Rediscovery of Gravitational Law with different types of constraints. $\lambda$ regulates the physical constraint strength during optimization. NC: No Constraint
  • Figure 3: Evaluation of disagreement measures. a) Disagreement analysis of 15 equations from PySR and 10,000 data points, resulting in a correlation coefficient of 0.91. Similar points exhibit maximum disagreement. b) Comparison of $f = (1 / 2\pi)^{1/2} \exp(-(x_1 - x_2) / \sigma)^2 / 2$ rediscovery performance versus the number of data points when adding new labeled points through active learning and random addition
  • Figure 4: Number of samples needed for rediscovery of non-trivial equations of Feynman dataset with (green) and without (blue) constraints. The inset chart shows that from the twelve tested equations, PySR with QBC outperformed AIFeynman in eight of them. Arrows show the direction of improvement.
  • Figure 5: Illustrating the Effects of Noise on Rediscovery: A Comparison between Constrained and Unconstrained Optimizations. The figure presents results for the expression $\sqrt{x_{1}^{2} + x_{2}^{2} - 2x_{1}x_{2}\cos(\theta_{1} - \theta_{2})}$. P-values are 0.061 for noiseless conditions and 0.443 for a noise level of 0.01. The joint p-value for comparing unconstrained and symmetry-constrained optimization is 0.127.
  • ...and 8 more figures