Table of Contents
Fetching ...

Probabilistic Regular Tree Priors for Scientific Symbolic Reasoning

Tim Schneider, Amin Totounferoush, Wolfgang Nowak, Steffen Staab

TL;DR

This work introduces probabilistic regular tree priors (pRTE) to encode scientific knowledge for symbolic regression via formal tree languages, addressing the mismatch between syntactic grammars and tree-structured expressions. By combining a tree-structured prior with discrete and continuous parameter priors and a likelihood from noisy data, the authors formulate a Bayesian inference framework that operates with finite-state machines and factor-graph density evaluations. They demonstrate improved generalization on soil sorption isotherms and synthetic hyper-elastic material models, illustrating better balance between model complexity and predictive accuracy compared to standard baselines. The approach enables closed under Boolean operations, scalable prior composition, and uncertainty-aware inference, offering a principled path for incorporating domain knowledge into symbolic regression with practical scientific impact.

Abstract

Symbolic Regression (SR) allows for the discovery of scientific equations from data. To limit the large search space of possible equations, prior knowledge has been expressed in terms of formal grammars that characterize subsets of arbitrary strings. However, there is a mismatch between context-free grammars required to express the set of syntactically correct equations, missing closure properties of the former, and a tree structure of the latter. Our contributions are to (i) compactly express experts' prior beliefs about which equations are more likely to be expected by probabilistic Regular Tree Expressions (pRTE), and (ii) adapt Bayesian inference to make such priors efficiently available for symbolic regression encoded as finite state machines. Our scientific case studies show its effectiveness in soil science to find sorption isotherms and for modeling hyper-elastic materials.

Probabilistic Regular Tree Priors for Scientific Symbolic Reasoning

TL;DR

This work introduces probabilistic regular tree priors (pRTE) to encode scientific knowledge for symbolic regression via formal tree languages, addressing the mismatch between syntactic grammars and tree-structured expressions. By combining a tree-structured prior with discrete and continuous parameter priors and a likelihood from noisy data, the authors formulate a Bayesian inference framework that operates with finite-state machines and factor-graph density evaluations. They demonstrate improved generalization on soil sorption isotherms and synthetic hyper-elastic material models, illustrating better balance between model complexity and predictive accuracy compared to standard baselines. The approach enables closed under Boolean operations, scalable prior composition, and uncertainty-aware inference, offering a principled path for incorporating domain knowledge into symbolic regression with practical scientific impact.

Abstract

Symbolic Regression (SR) allows for the discovery of scientific equations from data. To limit the large search space of possible equations, prior knowledge has been expressed in terms of formal grammars that characterize subsets of arbitrary strings. However, there is a mismatch between context-free grammars required to express the set of syntactically correct equations, missing closure properties of the former, and a tree structure of the latter. Our contributions are to (i) compactly express experts' prior beliefs about which equations are more likely to be expected by probabilistic Regular Tree Expressions (pRTE), and (ii) adapt Bayesian inference to make such priors efficiently available for symbolic regression encoded as finite state machines. Our scientific case studies show its effectiveness in soil science to find sorption isotherms and for modeling hyper-elastic materials.
Paper Structure (50 sections, 37 equations, 6 figures, 5 tables)

This paper contains 50 sections, 37 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Scientists typically have data and prior knowledge. Our Bayesian Inference (\ref{['sec:method:bayesian_inference']}) requires samples (\ref{['sec:method:prior_sampling']}) and density evaluations (\ref{['sec:method:prior_evaluation']}) of the latter and yields a posterior distribution over expressions that fit the data consistently to the prior knowledge.
  • Figure 2: We represent an expression (\ref{['fig:symbolic_example']}) in variables $x,y,z$ with a tree $t\in T_\Sigma$ (\ref{['fig:symbolic_example2']}) and $c \in \Sigma^{(0)}$ marking the positions of parameters $\theta_c = (7, 12)^T$. To evaluate a regular tree prior $p(t)$ we construct a factor graph (\ref{['fig:factor_graph']}): For every node in $t$, there are two random variables, $x_i$ and $q_i$. The random variables $x_i$ model the possible ranked symbols for a tree node and are thus observed, e.g. $x_5$ corresponds to $\hbox{-} \in \Sigma^{(0)}$ in $t$. and assignments to the random variable $q_5$ model the possible states a can be in after parsing the tree $t$ up to that symbol. Finally, factors $\phi_j$ express the transition probabilities of the . To determine $p(t)$ we condition on all $x_i$ (i.e. the symbols in the input tree $t$) and marginalize over all possible state assignments to $q_i$ (i.e. runs of the ).
  • Figure 3: Particle system: system knowledge defines a prior distribution of potential force functions (\ref{['fig:newtonian_example_prior']}), which results after inference with training data (\ref{['fig:newtonian_example_data']}) in a posterior distribution (\ref{['fig:newtonian_example_posterior']}). More data and combinations of knowledge (\ref{['fig:newtonian_example_result']}) help to find the true gravitational force. In (\ref{['fig:newtonian_example_data']}, \ref{['fig:newtonian_example_prior']}, \ref{['fig:newtonian_example_posterior']}) colored paths show the true trajectories of the particles in the corresponding datasets.
  • Figure 4: Example of synthetic sorption isotherm (Langmuir) data, with (\ref{['fig:prte_prior']}) samples from the prior encoded in the regular expression, (\ref{['fig:prte_posterior']}) the posterior of equations given data evidence. In contrast a typical baseline solution (\ref{['fig:baseline_results']}) yields unphysical predictions on the same data.
  • Figure 5: Marginal inference for $p(q_4)$ in a factor graph (\ref{['fig:context_factor_graph']}) to re-sample the context $t_{\leq r}$ of a tree $t$ in position $r \in D$ marked by $? \in \Sigma^{(0)}$. With the inferred state $q_4$ a new sub-tree can be grown.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 2.1: Tree
  • Example 1
  • Definition 4.1: Symbolic Expression
  • Example 2
  • Example 3: Factor Graph
  • Example 4: Extended pRTE example
  • Example 5: Symbolic Expression Definition