Beyond Accuracy and Complexity: The Effective Information Criterion for Structurally Stable Symbolic Regression
Zihan Yu, Guanren Wang, Jingtao Ding, Huandong Wang, Yong Li
TL;DR
The paper addresses the inadequacy of traditional accuracy- and complexity-based objectives in symbolic regression by introducing the Effective Information Criterion (EIC), which quantifies structural stability as the amplification of rounding noise through a formula's computation graph. EIC is defined via a recursive relation on a formula's symbol tree and a worst-case node metric: $\text{EIC} = \max_{k \in \mathcal{T}[f]} \log_{10} \bar{s}_k$, with $\bar{s}_k^2 = \mathbb{E}_{x}[s_k^2(x)]$ and $s_k^2(x) = 1 + \sum_{i \in \mathcal{C}[k]} \kappa_{k,i}^2(x) s_i^2(x)$, where $\kappa_{k,i}$ is a relative condition number. The authors show a structural stability gap between human-derived physical laws and SR-discovered formulas, and demonstrate EIC's practical utility: improving Pareto fronts in heuristic search, increasing pretraining sample efficiency for generative SR by filtering high-EIC samples, and aligning with human interpretability preferences in a majority of cases. These results suggest that enforcing structural stability via EIC can enhance interpretability and reliability in AI-driven scientific discovery. The work also provides physical and signal-processing interpretations, and validates EIC through extensive human expert and LLM corroboration.
Abstract
Symbolic regression (SR) traditionally balances accuracy and complexity, implicitly assuming that simpler formulas are structurally more rational. We argue that this assumption is insufficient: existing algorithms often exploit this metric to discover accurate and compact but structurally irrational formulas that are numerically ill-conditioned and physically inexplicable. Inspired by the structural stability of real physical laws, we propose the Effective Information Criterion (EIC) to quantify formula rationality. EIC models formulas as information channels and measures the amplification of inherent rounding noise during recursive calculation, effectively distinguishing physically plausible structures from pathological ones without relying on ground truth. Our analysis reveals a stark structural stability gap between human-derived equations and SR-discovered results. By integrating EIC into SR workflows, we provide explicit structural guidance: for heuristic search, EIC steers algorithms toward stable regions to yield superior Pareto frontiers; for generative models, EIC-based filtering improves pre-training sample efficiency by 2-4 times and boosts generalization R2 by 22.4%. Finally, an extensive study with 108 human experts shows that EIC aligns with human preferences in 70% of cases, validating structural stability as a critical prerequisite for human-perceived interpretability.
