Table of Contents
Fetching ...

LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy

TL;DR

LLM-SRBench presents a 239-problem benchmark for evaluating LLM-based scientific equation discovery, designed to mitigate memorization by introducing two problem families: LSR-Transform (transformed, less-common representations) and LSR-Synth (synthetic, discovery-driven tasks) across chemistry, biology, physics, and material science. The framework combines data-driven reasoning with embedded scientific priors and uses dual evaluation metrics—data fidelity ($Acc_{\tau}$, NMSE) and symbolic accuracy via an automated evaluator—to assess true discovery capabilities. Across multiple LLM backbones and three discovery baselines, the best symbolic accuracy reaches only about 31%, underscoring substantial challenges and the need for robust evaluation and improved reasoning in scientific equation discovery. By standardizing data generation, evaluation, and benchmarking, LLM-SRBench aims to drive progress toward truly data-guided, scientifically plausible equation discovery with LLMs.

Abstract

Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

TL;DR

LLM-SRBench presents a 239-problem benchmark for evaluating LLM-based scientific equation discovery, designed to mitigate memorization by introducing two problem families: LSR-Transform (transformed, less-common representations) and LSR-Synth (synthetic, discovery-driven tasks) across chemistry, biology, physics, and material science. The framework combines data-driven reasoning with embedded scientific priors and uses dual evaluation metrics—data fidelity (, NMSE) and symbolic accuracy via an automated evaluator—to assess true discovery capabilities. Across multiple LLM backbones and three discovery baselines, the best symbolic accuracy reaches only about 31%, underscoring substantial challenges and the need for robust evaluation and improved reasoning in scientific equation discovery. By standardizing data generation, evaluation, and benchmarking, LLM-SRBench aims to drive progress toward truly data-guided, scientifically plausible equation discovery with LLMs.

Abstract

Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

Paper Structure

This paper contains 31 sections, 3 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Error analysis comparing simple LLM sampling (Llama-3.1-8B) on 100 Feynman problems versus LLM-SRBench datasets (LSR-Transform and LSR-Synth). The sharp drops in numeric error curves and considerably lower symbolic error for Feynman problems suggest memorization rather than gradual discovery.
  • Figure 2: Overview of the LLM-based Scientific Equation Discovery. The benchmark tasks (left) combine scientific context with numerical data. The discovery process (middle) iteratively leverages LLM's scientific knowledge and data-driven reasoning to generate hypotheses for underlying equations. Discovered hypotheses, represented as equation strings, trees, or programs, are then evaluated (right) using multiple metrics including data fidelity, symbolic accuracy, and computational efficiency.
  • Figure 3: Data generation pipelines for the two dataset categories in LLM-SRBench.(a) LSR-Transform converts Feynman problems into alternative mathematical forms through symbolic transformation and input-output role switching, and (b) LSR-Synth generates novel discovery-driven problems by combining known scientific terms in the underlying models with synthetic novel terms. Both pipelines include validation steps to ensure solvability and scientific plausibility.
  • Figure 4: Performance comparison across equation complexity levels for Feynman and LSR-Transform datasets: (a) symbolic accuracy and (b) numeric precision ($\mathrm{Acc}_{0.1}$) showing considerable performance gap between these two datasets at same complexity levels (averaged over all method-LLM pairs).
  • Figure 5: Detailed results of in-domain (ID) and out-of-domain (OOD) performance using Normalized Mean Squared Error across various LSR-Synth scientific domains and LLM-based equation discovery methods (with GPT-4o-mini as LLM backbone).
  • ...and 16 more figures