Table of Contents
Fetching ...

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, Chandan K Reddy

TL;DR

This work tackles the challenge of data-driven scientific equation discovery by integrating domain knowledge through Large Language Models with executable-program representations of equations. LLM-SR generates equation skeletons as programs, optimizes their parameters with data-driven evaluators, and uses a multi-island experience buffer to guide iterative refinement. Across benchmarks spanning physics, biology, and materials science, LLM-SR achieves superior accuracy and markedly better out-of-domain generalization than state-of-the-art symbolic regression baselines, while requiring far fewer iterations. Ablation studies confirm the essential roles of problem priors, program-based representations, and iterative refinement in achieving robust discovery, suggesting a promising path for LLM-guided scientific reasoning and model discovery.

Abstract

Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data presents significant challenges due to the necessity of navigating extremely large combinatorial hypothesis spaces. Current methods of equation discovery, commonly known as symbolic regression techniques, largely focus on extracting equations from data alone, often neglecting the domain-specific prior knowledge that scientists typically depend on. They also employ limited representations such as expression trees, constraining the search space and expressiveness of equations. To bridge this gap, we introduce LLM-SR, a novel approach that leverages the extensive scientific knowledge and robust code generation capabilities of Large Language Models (LLMs) to discover scientific equations from data. Specifically, LLM-SR treats equations as programs with mathematical operators and combines LLMs' scientific priors with evolutionary search over equation programs. The LLM iteratively proposes new equation skeleton hypotheses, drawing from its domain knowledge, which are then optimized against data to estimate parameters. We evaluate LLM-SR on four benchmark problems across diverse scientific domains (e.g., physics, biology), which we carefully designed to simulate the discovery process and prevent LLM recitation. Our results demonstrate that LLM-SR discovers physically accurate equations that significantly outperform state-of-the-art symbolic regression baselines, particularly in out-of-domain test settings. We also show that LLM-SR's incorporation of scientific priors enables more efficient equation space exploration than the baselines. Code and data are available: https://github.com/deep-symbolic-mathematics/LLM-SR

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

TL;DR

This work tackles the challenge of data-driven scientific equation discovery by integrating domain knowledge through Large Language Models with executable-program representations of equations. LLM-SR generates equation skeletons as programs, optimizes their parameters with data-driven evaluators, and uses a multi-island experience buffer to guide iterative refinement. Across benchmarks spanning physics, biology, and materials science, LLM-SR achieves superior accuracy and markedly better out-of-domain generalization than state-of-the-art symbolic regression baselines, while requiring far fewer iterations. Ablation studies confirm the essential roles of problem priors, program-based representations, and iterative refinement in achieving robust discovery, suggesting a promising path for LLM-guided scientific reasoning and model discovery.

Abstract

Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data presents significant challenges due to the necessity of navigating extremely large combinatorial hypothesis spaces. Current methods of equation discovery, commonly known as symbolic regression techniques, largely focus on extracting equations from data alone, often neglecting the domain-specific prior knowledge that scientists typically depend on. They also employ limited representations such as expression trees, constraining the search space and expressiveness of equations. To bridge this gap, we introduce LLM-SR, a novel approach that leverages the extensive scientific knowledge and robust code generation capabilities of Large Language Models (LLMs) to discover scientific equations from data. Specifically, LLM-SR treats equations as programs with mathematical operators and combines LLMs' scientific priors with evolutionary search over equation programs. The LLM iteratively proposes new equation skeleton hypotheses, drawing from its domain knowledge, which are then optimized against data to estimate parameters. We evaluate LLM-SR on four benchmark problems across diverse scientific domains (e.g., physics, biology), which we carefully designed to simulate the discovery process and prevent LLM recitation. Our results demonstrate that LLM-SR discovers physically accurate equations that significantly outperform state-of-the-art symbolic regression baselines, particularly in out-of-domain test settings. We also show that LLM-SR's incorporation of scientific priors enables more efficient equation space exploration than the baselines. Code and data are available: https://github.com/deep-symbolic-mathematics/LLM-SR
Paper Structure (53 sections, 5 equations, 27 figures, 5 tables, 1 algorithm)

This paper contains 53 sections, 5 equations, 27 figures, 5 tables, 1 algorithm.

Figures (27)

  • Figure 1: The LLM-SR framework, consisting of three main steps: (a)Hypothesis Generation, where LLM generates equation program skeletons based on a structured prompt; (b)Data-driven Evaluation, which optimizes the parameters of each equation skeleton hypothesis and assesses its fit to the data; and (c)Experience Management, which maintains a diverse buffer of high-scoring hypotheses to provide informative in-context examples into LLM's prompt for effective iterative refinement.
  • Figure 2: Example of initial input prompt for the nonlinear oscillator discovery task, including problem specification, evaluation and optimization function, and the initial input equation example.
  • Figure 3: Best score trajectories of LLM-SR with GPT-3.5 and Mixtral against SR baselines across different benchmark problems. LLM-SR discovers accurate equations more efficiently, requiring fewer iterations. Baselines fail to match LLM-SR even after $2$M iterations.
  • Figure 4: Discovered equations for Oscillation 1 (top) and Oscillation 2 (bottom) problems: (a) True equations and their phase diagram; (b) Equation program skeletons identified by LLM-SR, with simplified forms obtained after parameter optimization; and (c) Equations found using SR baselines. Shaded green terms denote recovered symbolic terms from true equations.
  • Figure 5: Comparison of E. coli growth rate distributions from LLM-SR, PySR, and uDSR.
  • ...and 22 more figures