Table of Contents
Fetching ...

Beyond Accuracy and Complexity: The Effective Information Criterion for Structurally Stable Symbolic Regression

Zihan Yu, Guanren Wang, Jingtao Ding, Huandong Wang, Yong Li

TL;DR

The paper addresses the inadequacy of traditional accuracy- and complexity-based objectives in symbolic regression by introducing the Effective Information Criterion (EIC), which quantifies structural stability as the amplification of rounding noise through a formula's computation graph. EIC is defined via a recursive relation on a formula's symbol tree and a worst-case node metric: $\text{EIC} = \max_{k \in \mathcal{T}[f]} \log_{10} \bar{s}_k$, with $\bar{s}_k^2 = \mathbb{E}_{x}[s_k^2(x)]$ and $s_k^2(x) = 1 + \sum_{i \in \mathcal{C}[k]} \kappa_{k,i}^2(x) s_i^2(x)$, where $\kappa_{k,i}$ is a relative condition number. The authors show a structural stability gap between human-derived physical laws and SR-discovered formulas, and demonstrate EIC's practical utility: improving Pareto fronts in heuristic search, increasing pretraining sample efficiency for generative SR by filtering high-EIC samples, and aligning with human interpretability preferences in a majority of cases. These results suggest that enforcing structural stability via EIC can enhance interpretability and reliability in AI-driven scientific discovery. The work also provides physical and signal-processing interpretations, and validates EIC through extensive human expert and LLM corroboration.

Abstract

Symbolic regression (SR) traditionally balances accuracy and complexity, implicitly assuming that simpler formulas are structurally more rational. We argue that this assumption is insufficient: existing algorithms often exploit this metric to discover accurate and compact but structurally irrational formulas that are numerically ill-conditioned and physically inexplicable. Inspired by the structural stability of real physical laws, we propose the Effective Information Criterion (EIC) to quantify formula rationality. EIC models formulas as information channels and measures the amplification of inherent rounding noise during recursive calculation, effectively distinguishing physically plausible structures from pathological ones without relying on ground truth. Our analysis reveals a stark structural stability gap between human-derived equations and SR-discovered results. By integrating EIC into SR workflows, we provide explicit structural guidance: for heuristic search, EIC steers algorithms toward stable regions to yield superior Pareto frontiers; for generative models, EIC-based filtering improves pre-training sample efficiency by 2-4 times and boosts generalization R2 by 22.4%. Finally, an extensive study with 108 human experts shows that EIC aligns with human preferences in 70% of cases, validating structural stability as a critical prerequisite for human-perceived interpretability.

Beyond Accuracy and Complexity: The Effective Information Criterion for Structurally Stable Symbolic Regression

TL;DR

The paper addresses the inadequacy of traditional accuracy- and complexity-based objectives in symbolic regression by introducing the Effective Information Criterion (EIC), which quantifies structural stability as the amplification of rounding noise through a formula's computation graph. EIC is defined via a recursive relation on a formula's symbol tree and a worst-case node metric: , with and , where is a relative condition number. The authors show a structural stability gap between human-derived physical laws and SR-discovered formulas, and demonstrate EIC's practical utility: improving Pareto fronts in heuristic search, increasing pretraining sample efficiency for generative SR by filtering high-EIC samples, and aligning with human interpretability preferences in a majority of cases. These results suggest that enforcing structural stability via EIC can enhance interpretability and reliability in AI-driven scientific discovery. The work also provides physical and signal-processing interpretations, and validates EIC through extensive human expert and LLM corroboration.

Abstract

Symbolic regression (SR) traditionally balances accuracy and complexity, implicitly assuming that simpler formulas are structurally more rational. We argue that this assumption is insufficient: existing algorithms often exploit this metric to discover accurate and compact but structurally irrational formulas that are numerically ill-conditioned and physically inexplicable. Inspired by the structural stability of real physical laws, we propose the Effective Information Criterion (EIC) to quantify formula rationality. EIC models formulas as information channels and measures the amplification of inherent rounding noise during recursive calculation, effectively distinguishing physically plausible structures from pathological ones without relying on ground truth. Our analysis reveals a stark structural stability gap between human-derived equations and SR-discovered results. By integrating EIC into SR workflows, we provide explicit structural guidance: for heuristic search, EIC steers algorithms toward stable regions to yield superior Pareto frontiers; for generative models, EIC-based filtering improves pre-training sample efficiency by 2-4 times and boosts generalization R2 by 22.4%. Finally, an extensive study with 108 human experts shows that EIC aligns with human preferences in 70% of cases, validating structural stability as a critical prerequisite for human-perceived interpretability.

Paper Structure

This paper contains 26 sections, 2 theorems, 36 equations, 15 figures, 7 tables, 2 algorithms.

Key Result

Proposition 3.2

where $\kappa_{k, i} \triangleq \frac{y_i}{y_k} \frac{\partial e_k}{\partial y_i}$ is the partial relative condition number of the operation $e_k$ with respect to its operand $i$. For leaf nodes where $\mathcal{C}[k] = \emptyset$, we assume its $s_k^2(x) = 1 + 0 = 1$.

Figures (15)

  • Figure 1: Formulas with identical complexity and accuracy can exhibit distinct structural rationality. Despite having the same length and $R^2$, the left formula contains pathological nesting, while the right one remains structurally sound, demonstrating that complexity fails to capture structural rationality.
  • Figure 2: The structural stability gap between real physical formulas and SR results. We compare 133 ground truth formulas (Feynman, Strogatz) with results from 17 SR algorithms, quantified by the noise amplification factor (Section \ref{['sec:method']}). The inner box shows the quartiles, and the black dots indicate outliers beyond $1.5\times$IQR.
  • Figure 3: The structural stability gap between real physical formulas against SR discovered ones. The figure shows the average complexity and EIC of formulas discovered by 17 SR methods on 133 SRBench white-box problems (SBP-GP is not displayed due to its high EIC). The underfit formulas with test set $R^2$ below $0.8$ are ignored to focus on accurate candidates.
  • Figure 4: Pareto fronts on the SR benchmark.$\dagger$, $\ddagger$, $*$, and $**$ indicate generative, regression, decision-tree, and black-box methods, respectively, while others are search-based methods. The lines show the Pareto front tiers, from bottom-left (best) to top-right (worst).
  • Figure 5: Generalization performance of E2ESR trained on different samples. The grey line shows the convergence trained on random formulas.
  • ...and 10 more figures

Theorems & Definitions (8)

  • Proposition 3.2: Recursive Relation of $s_k(x)$
  • proof
  • Lemma A.1
  • proof
  • proof
  • proof
  • proof
  • proof