Table of Contents
Fetching ...

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu

TL;DR

This work tackles the challenge of discovering interpretable symbolic equations from data by introducing KeplerAgent, a physics-guided LLM agent that deliberately mimics the stepwise reasoning of scientists. The agent orchestrates physics-based tools to extract intermediate structure (e.g., symmetries, priors) and then configures symbolic regression backends such as PySINDy and PySR, substantially narrowing the search space. Across algebraic, ODE, and PDE benchmarks, KeplerAgent achieves higher symbolic accuracy and markedly better robustness to noise than both pure LLM-based SR methods and standard SR baselines, while maintaining reasonable runtime. The results demonstrate the practical potential of integrating structure discovery with SR in scientific discovery, with implications for faster, more transparent model discovery in dynamical systems and beyond; however, the approach relies on an expanding toolbox and careful management of computational costs, suggesting future work in scaling and more explicit reasoning frameworks.

Abstract

Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

TL;DR

This work tackles the challenge of discovering interpretable symbolic equations from data by introducing KeplerAgent, a physics-guided LLM agent that deliberately mimics the stepwise reasoning of scientists. The agent orchestrates physics-based tools to extract intermediate structure (e.g., symmetries, priors) and then configures symbolic regression backends such as PySINDy and PySR, substantially narrowing the search space. Across algebraic, ODE, and PDE benchmarks, KeplerAgent achieves higher symbolic accuracy and markedly better robustness to noise than both pure LLM-based SR methods and standard SR baselines, while maintaining reasonable runtime. The results demonstrate the practical potential of integrating structure discovery with SR in scientific discovery, with implications for faster, more transparent model discovery in dynamical systems and beyond; however, the approach relies on an expanding toolbox and careful management of computational costs, suggesting future work in scaling and more explicit reasoning frameworks.

Abstract

Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.
Paper Structure (42 sections, 1 equation, 6 figures, 3 tables)

This paper contains 42 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our KeplerAgent orchestrates physics-based tools and is capable of discovering different types of equations from data.
  • Figure 2: The overall design of our KeplerAgent framework. The input to the LLM contains a system prompt explaining the task setup, a list of tool specifications, the user query containing information about the dataset, as well as a workspace summary and an experience log recording previous steps. The agent reasons about the existing findings and decides subsequent tool calls iteratively until obtaining a satisfactory equation discovery result.
  • Figure 3: Distribution of NMSEs on LSR-Transform equations for PySR, LLM-SR, and KeplerAgent with one/three runs. The horizontal red dashed line annotates the average NMSE $=0.0091$ of LLM-SR reported in shojaee2025llm.
  • Figure 4: Distribution of NMSEs on DiffEq datasets for PySR, LLM-SR, and KeplerAgent (with a single run). Median values are shown next to the median lines.
  • Figure 5: Plotting normalized MSE against time for equations discovered on clean data from PySR, LLM-SR, and our KeplerAgent. A method is not included in a subplot if its discovered equation causes simulation failure (e.g., state variables going to infinity, or other numerical issues).
  • ...and 1 more figures