Probabilistic Regular Tree Priors for Scientific Symbolic Reasoning
Tim Schneider, Amin Totounferoush, Wolfgang Nowak, Steffen Staab
TL;DR
This work introduces probabilistic regular tree priors (pRTE) to encode scientific knowledge for symbolic regression via formal tree languages, addressing the mismatch between syntactic grammars and tree-structured expressions. By combining a tree-structured prior with discrete and continuous parameter priors and a likelihood from noisy data, the authors formulate a Bayesian inference framework that operates with finite-state machines and factor-graph density evaluations. They demonstrate improved generalization on soil sorption isotherms and synthetic hyper-elastic material models, illustrating better balance between model complexity and predictive accuracy compared to standard baselines. The approach enables closed under Boolean operations, scalable prior composition, and uncertainty-aware inference, offering a principled path for incorporating domain knowledge into symbolic regression with practical scientific impact.
Abstract
Symbolic Regression (SR) allows for the discovery of scientific equations from data. To limit the large search space of possible equations, prior knowledge has been expressed in terms of formal grammars that characterize subsets of arbitrary strings. However, there is a mismatch between context-free grammars required to express the set of syntactically correct equations, missing closure properties of the former, and a tree structure of the latter. Our contributions are to (i) compactly express experts' prior beliefs about which equations are more likely to be expected by probabilistic Regular Tree Expressions (pRTE), and (ii) adapt Bayesian inference to make such priors efficiently available for symbolic regression encoded as finite state machines. Our scientific case studies show its effectiveness in soil science to find sorption isotherms and for modeling hyper-elastic materials.
