Table of Contents
Fetching ...

Parsing the Language of Expression: Enhancing Symbolic Regression with Domain-Aware Symbolic Priors

Sikai Huang, Yixin Berry Wen, Tara Adusumilli, Kusum Choudhary, Haizhao Yang

TL;DR

This work tackles the challenge of symbolic regression by injecting domain-specific priors into the expression search to improve interpretability and learning efficiency. It introduces a multi-branch expression representation and a tree-structured RL agent, augmented with hard and soft priors encoded as conditional distributions and enforced through KL-divergence regularization and hard constraints. Priors are systematically extracted from large corpora across physics, biology, chemistry, and engineering, with normalization to mitigate variable-count biases, and are analyzed along vertical and horizontal dimensions. Experiments on SRBench benchmarks and cross-domain cases show that priors plus hierarchical search yield faster convergence and higher recovery rates, though the approach can introduce bias in some domains, which is mitigated by dynamic KL scheduling and future work on learning priors jointly with the agent.

Abstract

Symbolic regression is essential for deriving interpretable expressions that elucidate complex phenomena by exposing the underlying mathematical and physical relationships in data. In this paper, we present an advanced symbolic regression method that integrates symbol priors from diverse scientific domains - including physics, biology, chemistry, and engineering - into the regression process. By systematically analyzing domain-specific expressions, we derive probability distributions of symbols to guide expression generation. We propose novel tree-structured recurrent neural networks (RNNs) that leverage these symbol priors, enabling domain knowledge to steer the learning process. Additionally, we introduce a hierarchical tree structure for representing expressions, where unary and binary operators are organized to facilitate more efficient learning. To further accelerate training, we compile characteristic expression blocks from each domain and include them in the operator dictionary, providing relevant building blocks. Experimental results demonstrate that leveraging symbol priors significantly enhances the performance of symbolic regression, resulting in faster convergence and higher accuracy.

Parsing the Language of Expression: Enhancing Symbolic Regression with Domain-Aware Symbolic Priors

TL;DR

This work tackles the challenge of symbolic regression by injecting domain-specific priors into the expression search to improve interpretability and learning efficiency. It introduces a multi-branch expression representation and a tree-structured RL agent, augmented with hard and soft priors encoded as conditional distributions and enforced through KL-divergence regularization and hard constraints. Priors are systematically extracted from large corpora across physics, biology, chemistry, and engineering, with normalization to mitigate variable-count biases, and are analyzed along vertical and horizontal dimensions. Experiments on SRBench benchmarks and cross-domain cases show that priors plus hierarchical search yield faster convergence and higher recovery rates, though the approach can introduce bias in some domains, which is mitigated by dynamic KL scheduling and future work on learning priors jointly with the agent.

Abstract

Symbolic regression is essential for deriving interpretable expressions that elucidate complex phenomena by exposing the underlying mathematical and physical relationships in data. In this paper, we present an advanced symbolic regression method that integrates symbol priors from diverse scientific domains - including physics, biology, chemistry, and engineering - into the regression process. By systematically analyzing domain-specific expressions, we derive probability distributions of symbols to guide expression generation. We propose novel tree-structured recurrent neural networks (RNNs) that leverage these symbol priors, enabling domain knowledge to steer the learning process. Additionally, we introduce a hierarchical tree structure for representing expressions, where unary and binary operators are organized to facilitate more efficient learning. To further accelerate training, we compile characteristic expression blocks from each domain and include them in the operator dictionary, providing relevant building blocks. Experimental results demonstrate that leveraging symbol priors significantly enhances the performance of symbolic regression, resulting in faster convergence and higher accuracy.

Paper Structure

This paper contains 20 sections, 21 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: A tree-structured RNN-based reinforcement learning framework for generating symbolic expressions. Domain-specific priors (top) are incorporated as soft constraints (KL divergence) and hard constraints (rule-based masking), guiding the controller to propose expressive yet valid “skeletons.” Sampled expressions are then refined by optimizing their parameters, yielding interpretable mathematical models aligned with target data.
  • Figure 2: Parameter optimization and candidate selection loop. The tree-structured RNN controller generates candidate expressions, which are evaluated against the data to produce a reward signal. High-scoring candidates are selected for further parameter tuning, where the weights $\theta=\{\alpha, \beta, \gamma\}$ are refined to better fit the target data, ultimately yielding more accurate symbolic expressions.
  • Figure 3: The top panel illustrates the fundamental structure of our representation method, while the bottom panel presents two example expressions represented using this structure.
  • Figure 4: Examples of two expression trees, illustrating their subsequences, width, and depth.
  • Figure 5: Statistical distributions of expression depth, width, and root nodes across Physics, Biology, Chemistry, and Engineering.
  • ...and 5 more figures