Table of Contents
Fetching ...

Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

TL;DR

SymLang (Symmetry-constrained Language-guided equation discovery), a unified framework that brings together three previously separate ideas, is introduced, providing a principled pathway from raw data to interpretable, physically auditable symbolic laws.

Abstract

Discovering compact governing equations from experimental observations is one of the defining objectives of quantitative science, yet practical discovery pipelines routinely fail when measurements are noisy, relevant state variables are unobserved, or multiple symbolic structures explain the data equally well within statistical uncertainty. Here we introduce SymLang (Symmetry-constrained Language-guided equation discovery), a unified framework that brings together three previously separate ideas: (i) typed symmetry-constrained grammars that encode dimensional analysis, group-theoretic invariance, and parity constraints as hard production rules, eliminating on average 71.3% of candidate expression trees before any fitting; (ii) language-model-guided program synthesis in which a fine-tuned 7B-parameter proposer, conditioned on interpretable data descriptors, efficiently navigates the constrained search space; and (iii) MDL-regularized Bayesian model selection coupled with block-bootstrap stability analysis that quantifies structural uncertainty rather than committing to a single best equation. Across 133 dynamical systems spanning classical mechanics, electrodynamics, thermodynamics, population dynamics, and nonlinear oscillators, SymLang achieves an exact structural recovery rate of 83.7% under 10% observational noise - a 22.4 percentage-point improvement over the next-best baseline - while reducing out-of-distribution extrapolation error by 61% and near-eliminating conservation-law violations (3.1 x 10-3 vs. 187.3 x 10-3 physical drift for the closest competitor). In all tested regimes the framework correctly identifies structural degeneracy, reporting it explicitly rather than returning a confidently wrong single equation. The framework is fully open-source and reproducible, providing a principled pathway from raw data to interpretable, physically auditable symbolic laws.

Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations

TL;DR

SymLang (Symmetry-constrained Language-guided equation discovery), a unified framework that brings together three previously separate ideas, is introduced, providing a principled pathway from raw data to interpretable, physically auditable symbolic laws.

Abstract

Discovering compact governing equations from experimental observations is one of the defining objectives of quantitative science, yet practical discovery pipelines routinely fail when measurements are noisy, relevant state variables are unobserved, or multiple symbolic structures explain the data equally well within statistical uncertainty. Here we introduce SymLang (Symmetry-constrained Language-guided equation discovery), a unified framework that brings together three previously separate ideas: (i) typed symmetry-constrained grammars that encode dimensional analysis, group-theoretic invariance, and parity constraints as hard production rules, eliminating on average 71.3% of candidate expression trees before any fitting; (ii) language-model-guided program synthesis in which a fine-tuned 7B-parameter proposer, conditioned on interpretable data descriptors, efficiently navigates the constrained search space; and (iii) MDL-regularized Bayesian model selection coupled with block-bootstrap stability analysis that quantifies structural uncertainty rather than committing to a single best equation. Across 133 dynamical systems spanning classical mechanics, electrodynamics, thermodynamics, population dynamics, and nonlinear oscillators, SymLang achieves an exact structural recovery rate of 83.7% under 10% observational noise - a 22.4 percentage-point improvement over the next-best baseline - while reducing out-of-distribution extrapolation error by 61% and near-eliminating conservation-law violations (3.1 x 10-3 vs. 187.3 x 10-3 physical drift for the closest competitor). In all tested regimes the framework correctly identifies structural degeneracy, reporting it explicitly rather than returning a confidently wrong single equation. The framework is fully open-source and reproducible, providing a principled pathway from raw data to interpretable, physically auditable symbolic laws.
Paper Structure (13 sections, 1 theorem, 12 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 1 theorem, 12 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Let $\mathcal{G}_0$ be the unconstrained grammar and $\mathcal{G}_c$ the type-consistent grammar over the same operator set. If $f$ satisfies $k$ independent type constraints, then $|\mathcal{L}(\mathcal{G}_c,\ell)|\le C^{-k}|\mathcal{L}(\mathcal{G}_0,\ell)|$ for a constant $C>1$ depending on the op

Figures (4)

  • Figure 1: SymLang pipeline. Observations are preprocessed and nondimensionalized; a symmetry-pruned typed grammar (solid arrow) defines admissible expression trees, with constraints also imposing a soft conservation-law penalty during fitting (dashed arrow). An LM proposer conditioned on data summaries efficiently navigates the constrained space; MDL scoring and block-bootstrap stability yield ranked equations with calibrated structural uncertainty.
  • Figure 2: (a) Sample efficiency. Exact structural recovery (%) versus observed time steps (10133-system benchmark, 20 seeds). Dashed verticals mark the 80threshold crossings for SymLang (${\approx}4.8$k steps) and PySR (${\approx}19$k steps); the double arrow quantifies the $4\times$ sample advantage of grammar-constrained search. (b) Structure uncertainty regimes. Model weight distributions for three representative systems. (b1) Identifiable: $>$96% weight mass on $e_1$, high bootstrap stability. (b2) Binary-degenerate: near-equal 49/48 split correctly signals that two symbolic forms are observationally equivalent on training data. (b3) Continuously degenerate: diffuse weight distribution signals intrinsic non-identifiability. Only SymLang produces these diagnostics; all baselines return a single point estimate with no uncertainty.
  • Figure 3: Recovery rate vs. noise level. Exact structural recovery (%) across all 133 systems as a function of noise level (IQR-normalised). SymLang maintains the largest absolute margin over all baselines across the full noise range tested. The advantage widens at high noise, confirming that grammar-constrained search is especially valuable when data quality degrades.
  • Figure 4: Per-domain exact structural recovery at 10% noise. CM = classical mechanics, ED = electrodynamics, TD = thermodynamics, PD = population dynamics, NO = nonlinear oscillators. SymLang leads in every domain; the largest gains occur in CM (energy conservation pruning) and ED (dimensional constraints on circuit equations).

Theorems & Definitions (2)

  • Definition 1: Type-consistent grammar
  • Proposition 1: Pruning bound