Table of Contents
Fetching ...

Prior-Guided Symbolic Regression: Towards Scientific Consistency in Equation Discovery

Jing Xiao, Xinhai Chen, Jiaming Peng, Qinglin Wang, Menghan Jia, Zhiquan Lai, Guangping Yu, Dongsheng Li, Tiejun Li, Jie Liu

TL;DR

Symbolic regression often yields pseudo-equations that fit observed data yet violate fundamental principles. The authors propose PG-SR, a three-stage prior-guided SR framework with an explicit prior constraint checker and Prior-Annealed Constrained Evaluation (PACE) to steer discovery toward scientifically consistent regions. They prove that constraining the hypothesis space to prior-aligned subspaces reduces the Rademacher complexity, yielding tighter generalization bounds and a formal guarantee against pseudo-equations. Empirically, PG-SR outperforms state-of-the-art baselines across diverse domains and shows robustness to prior quality, noise, and data scarcity, recovering ground-truth-like dynamics in several cases. This work advances interpretable, scientifically grounded equation discovery and points toward automation of prior constraint synthesis via learning-based priors.

Abstract

Symbolic Regression (SR) aims to discover interpretable equations from observational data, with the potential to reveal underlying principles behind natural phenomena. However, existing approaches often fall into the Pseudo-Equation Trap: producing equations that fit observations well but remain inconsistent with fundamental scientific principles. A key reason is that these approaches are dominated by empirical risk minimization, lacking explicit constraints to ensure scientific consistency. To bridge this gap, we propose PG-SR, a prior-guided SR framework built upon a three-stage pipeline consisting of warm-up, evolution, and refinement. Throughout the pipeline, PG-SR introduces a prior constraint checker that explicitly encodes domain priors as executable constraint programs, and employs a Prior Annealing Constrained Evaluation (PACE) mechanism during the evolution stage to progressively steer discovery toward scientifically consistent regions. Theoretically, we prove that PG-SR reduces the Rademacher complexity of the hypothesis space, yielding tighter generalization bounds and establishing a guarantee against pseudo-equations. Experimentally, PG-SR outperforms state-of-the-art baselines across diverse domains, maintaining robustness to varying prior quality, noisy data, and data scarcity.

Prior-Guided Symbolic Regression: Towards Scientific Consistency in Equation Discovery

TL;DR

Symbolic regression often yields pseudo-equations that fit observed data yet violate fundamental principles. The authors propose PG-SR, a three-stage prior-guided SR framework with an explicit prior constraint checker and Prior-Annealed Constrained Evaluation (PACE) to steer discovery toward scientifically consistent regions. They prove that constraining the hypothesis space to prior-aligned subspaces reduces the Rademacher complexity, yielding tighter generalization bounds and a formal guarantee against pseudo-equations. Empirically, PG-SR outperforms state-of-the-art baselines across diverse domains and shows robustness to prior quality, noise, and data scarcity, recovering ground-truth-like dynamics in several cases. This work advances interpretable, scientifically grounded equation discovery and points toward automation of prior constraint synthesis via learning-based priors.

Abstract

Symbolic Regression (SR) aims to discover interpretable equations from observational data, with the potential to reveal underlying principles behind natural phenomena. However, existing approaches often fall into the Pseudo-Equation Trap: producing equations that fit observations well but remain inconsistent with fundamental scientific principles. A key reason is that these approaches are dominated by empirical risk minimization, lacking explicit constraints to ensure scientific consistency. To bridge this gap, we propose PG-SR, a prior-guided SR framework built upon a three-stage pipeline consisting of warm-up, evolution, and refinement. Throughout the pipeline, PG-SR introduces a prior constraint checker that explicitly encodes domain priors as executable constraint programs, and employs a Prior Annealing Constrained Evaluation (PACE) mechanism during the evolution stage to progressively steer discovery toward scientifically consistent regions. Theoretically, we prove that PG-SR reduces the Rademacher complexity of the hypothesis space, yielding tighter generalization bounds and establishing a guarantee against pseudo-equations. Experimentally, PG-SR outperforms state-of-the-art baselines across diverse domains, maintaining robustness to varying prior quality, noisy data, and data scarcity.
Paper Structure (53 sections, 3 theorems, 35 equations, 9 figures, 4 tables)

This paper contains 53 sections, 3 theorems, 35 equations, 9 figures, 4 tables.

Key Result

Lemma 3.2

Given the hypothesis space $\mathcal{H}$, assume the loss function is $\lambda$-Lipschitz and bounded by $M$. For any $\delta > 0$, with probability at least $1-\delta$, the expected risk $R(f)$ satisfies: where $\mathcal{R}_N(\mathcal{H})$ denotes the Rademacher complexity of $\mathcal{H}$. (See Appendix sec:proof_thm_3_2 for the detailed formulation.)

Figures (9)

  • Figure 1: The Pseudo-Equation Trap. The pseudo-equation (orange) is incorrectly accepted as it fits noisy training data perfectly, despite deviating from the true underlying equation. In contrast, the consistent candidate (blue) is incorrectly rejected because it fits noise poorly, even though it aligns with the true equation.
  • Figure 2: Overview of the PG-SR framework with a three-stage pipeline comprising Warm-up, Evolution, and Refinement. The left panel illustrates the construction of the prior constraint checker.
  • Figure 3: Qualitative comparison of predictive trajectories for PG-SR and baselines on ID (gray) and OOD (colored) regions.
  • Figure 4: OOD generalization of PySR and LLM-SR with and without prior augmentation, compared to PG-SR.
  • Figure 5: PG-SR performance under different prior quality settings: No Prior, Weak Priors, Wrong Priors, and Full Priors.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Definition 3.1: Symbolic Hypothesis Space
  • Lemma 3.2: Generalization Bound
  • Corollary 3.3: The Pseudo-Equation Trap
  • Definition 3.4: Prior-Constrained Subspace
  • Proposition 3.5: Consistency-Guaranteed Generalization
  • proof
  • proof