Table of Contents
Fetching ...

Iterated Agent for Symbolic Regression

Zhuo-Yang Song, Zeyu Cai, Shutao Zhang, Jiashen Wei, Jichen Pan, Shi Qiu, Qing-Hong Cao, Tie-Jiun Hou, Xiaohui Liu, Ming-xing Luo, Hua Xing Zhu

TL;DR

This work tackles symbolic regression by moving search from pure syntax to semantics, using an iterated agent framework where Large Language Models generate semantically informed hypotheses guided by natural-language rationales. IdeaSearchFitter biases the search toward interpretable, physics-aligned expressions and operates within a multi-island evolutionary loop, balancing accuracy, complexity, and interpretability via a Pareto frontier. Empirical results on FSReD demonstrate strong noise robustness and competitive recovery rates, while real-world PMLB datasets reveal interpretable, mechanism-aligned models with favorable NMSE/complexity trade-offs. A frontier PDF case study shows compact, extrapolation-stable parametrizations that align with DGLAP evolution, underscoring the framework’s potential for physics-informed discovery and broader scientific applications.

Abstract

Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter's efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at https://www.ideasearch.cn/.

Iterated Agent for Symbolic Regression

TL;DR

This work tackles symbolic regression by moving search from pure syntax to semantics, using an iterated agent framework where Large Language Models generate semantically informed hypotheses guided by natural-language rationales. IdeaSearchFitter biases the search toward interpretable, physics-aligned expressions and operates within a multi-island evolutionary loop, balancing accuracy, complexity, and interpretability via a Pareto frontier. Empirical results on FSReD demonstrate strong noise robustness and competitive recovery rates, while real-world PMLB datasets reveal interpretable, mechanism-aligned models with favorable NMSE/complexity trade-offs. A frontier PDF case study shows compact, extrapolation-stable parametrizations that align with DGLAP evolution, underscoring the framework’s potential for physics-informed discovery and broader scientific applications.

Abstract

Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter's efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at https://www.ideasearch.cn/.

Paper Structure

This paper contains 46 sections, 9 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Schematic overview of the IdeaSearchFitter framework. The workflow begins with input data comprising variables ($g(m/s^2)$, $r(m)$), targets ($\nu(m/s)$), and uncertainties ($\Delta\nu(m/s)$), alongside a natural language description of the underlying mechanics (e.g., gravitational effects). This is condensed into a structured data description, an LLM will expand it to be an enriched context. An ansatz agent then prompts an LLM to generate candidate symbolic expressions, which are evaluated using the reduced $\chi^2/\mathrm{ndf}$ statistic ($\chi^2/\mathrm{ndf} = \sum [(y_\mathrm{pred,i} - y_\mathrm{true,i})^2/\sigma_i^2] / \mathrm{ndf}$) against the data. Evaluated ideas populate a multi-island database, where local mutations and inter-island migrations promote diversity and prevent premature convergence. Promising ansatz are iteratively refined via LLM-guided polishing and global optimization. The search culminates in Pareto frontier selection, trading off fit quality ($\chi^2/\mathrm{ndf}$) against structural complexity (e.g., expression tree nodes), yielding a set of interpretable models; the true relation (e.g., $\nu = \sqrt{2gr}$) is highlighted among candidates.
  • Figure 2: Recovery rate versus noise level on the FSReD. The plot illustrates the fraction of ground-truth expressions successfully recovered by each method as a function of additive Gaussian noise intensity ($\gamma$). IdeaSearchFitter (blue line with error bars) demonstrates good robustness and ability, with recovery rates exceeding 70% even at $\gamma=0.1$, compared to baselines including PySR (orange), Operon (green), and AI-Feynman (red). The purple line represents IdeaSearchFitter evaluated with dimensional analysis enabled. Error bars denote standard error across 120 problems.
  • Figure 3: Minimum validation NMSE achieved by SR methods across eight PMLB datasets. Bars: lowest validation NMSE and complexity for IdeaSearchFitter (blue), PySR (orange), Operon (green), AI-Feynman (red); absent bars indicate failures (e.g., AI-Feynman on 192_vinyard datasets).
  • Figure 4: Complexity-NMSE Pareto frontiers for two PMLB datasets (75%-25% train-validation split). Training NMSE (dashed) and validation NMSE (markers) versus tree nodes. IdeaSearchFitter (blue) shows monotonic decrease without overfitting rebounds; maps to mechanisms (logarithmic memory in (a); $|t|$-regime separation in (b); cf. Tab. \ref{['tab:publication_comparison']}). Baselines: PySR (orange, rebound at $\sim10$ nodes in (a)); Operon (green); AI-Feynman (red, failures).
  • Figure 5: Training–validation $\chi^2$ comparison for the anti-up quark ($\bar{u}$) Pareto front. Axes are logarithmic; colour encodes expression complexity (node count). Left: IdeaSearchFitter exhibits a coherent frontier where training and validation $\chi^2$ decrease jointly as complexity increases. Right: PySR displays two regimes. In the low‑training‑error regime (left portion of the panel), increasing complexity reduces training $\chi^2$ while raising validation $\chi^2$. In the higher‑training‑error regime (right portion), training and validation $\chi^2$ decrease together, but complexity changes are erratic, signaling poor numerical stability. The best validation point of IdeaSearchFitter attains a smaller validation $\chi^2$ than PySR, and its associated training $\chi^2$ is at approximately the same level as the elbow of PySR’s frontier.
  • ...and 17 more figures

Theorems & Definitions (5)

  • Definition 1: Discovery as Search in a Functional Space
  • Definition 2: The Agent Operator
  • Definition 3: Capability and the Guiding Measure
  • Definition 4: The Iterated Agent
  • Definition 5: Search Trajectory