Parsing the Language of Expression: Enhancing Symbolic Regression with Domain-Aware Symbolic Priors
Sikai Huang, Yixin Berry Wen, Tara Adusumilli, Kusum Choudhary, Haizhao Yang
TL;DR
This work tackles the challenge of symbolic regression by injecting domain-specific priors into the expression search to improve interpretability and learning efficiency. It introduces a multi-branch expression representation and a tree-structured RL agent, augmented with hard and soft priors encoded as conditional distributions and enforced through KL-divergence regularization and hard constraints. Priors are systematically extracted from large corpora across physics, biology, chemistry, and engineering, with normalization to mitigate variable-count biases, and are analyzed along vertical and horizontal dimensions. Experiments on SRBench benchmarks and cross-domain cases show that priors plus hierarchical search yield faster convergence and higher recovery rates, though the approach can introduce bias in some domains, which is mitigated by dynamic KL scheduling and future work on learning priors jointly with the agent.
Abstract
Symbolic regression is essential for deriving interpretable expressions that elucidate complex phenomena by exposing the underlying mathematical and physical relationships in data. In this paper, we present an advanced symbolic regression method that integrates symbol priors from diverse scientific domains - including physics, biology, chemistry, and engineering - into the regression process. By systematically analyzing domain-specific expressions, we derive probability distributions of symbols to guide expression generation. We propose novel tree-structured recurrent neural networks (RNNs) that leverage these symbol priors, enabling domain knowledge to steer the learning process. Additionally, we introduce a hierarchical tree structure for representing expressions, where unary and binary operators are organized to facilitate more efficient learning. To further accelerate training, we compile characteristic expression blocks from each domain and include them in the operator dictionary, providing relevant building blocks. Experimental results demonstrate that leveraging symbol priors significantly enhances the performance of symbolic regression, resulting in faster convergence and higher accuracy.
