Table of Contents
Fetching ...

Finetuning Large Language Model as an Effective Symbolic Regressor

Yingfan Hua, Ruikun Li, Jun Yao, Guohang Zhuang, Shixiang Tang, Bin Liu, Wanli Ouyang, Yan Lu

TL;DR

This benchmark comprises over 148,000 diverse equations formulated as corpora of 1.83 billion tokens for LLM utilization, enabling effective training and inference and proposes a heuristics metric to precisely quantify form-level consistency, going beyond existing SR numerical-oriented evaluation strategies.

Abstract

Deriving governing equations from observational data, known as Symbolic Regression (SR), is a cornerstone of scientific discovery. Large Language Models, (LLMs) have shown promise in this task by leveraging their vast cross-disciplinary scientific knowledge. However, existing LLM-based methods primarily rely on direct inference or prompt engineering, often requiring excessive inference iterations to converge on correct formulas or failing to treat complex equation targets. These limitations in effectiveness and generalization stem from an inherent tension between pre-trained LLMs' proficiency in approximate reasoning and the high-precision demands of SR tasks. To bridge this gap, we propose to fine-tune LLMs for enhanced SR capability. Yet, the absence of dedicated datasets for SR-oriented fine-tuning remains a critical barrier. We thus introduce SymbArena, specifically engineered to optimize LLMs for SR. This benchmark comprises over 148,000 diverse equations formulated as corpora of 1.83 billion tokens for LLM utilization, enabling effective training and inference. Further, to ensure a more comprehensive and fair evaluation, SymbArena proposes a heuristics metric to precisely quantify form-level consistency, going beyond existing SR numerical-oriented evaluation strategies. With this benchmark, we explore mainstream LLM fine-tuning techniques for SR tasks and establish Symbolic-R1, a simple yet effective LLM-based SR strong baseline. Experimental results validate Symbolic-R1 as the first LLM to exceed traditional numerical methods in both numerical precision and symbolic form accuracy, outperforming the second-best LLM baseline with improvements of 2-fold gains in R2 score and 10.3% in form-level consistency score.

Finetuning Large Language Model as an Effective Symbolic Regressor

TL;DR

This benchmark comprises over 148,000 diverse equations formulated as corpora of 1.83 billion tokens for LLM utilization, enabling effective training and inference and proposes a heuristics metric to precisely quantify form-level consistency, going beyond existing SR numerical-oriented evaluation strategies.

Abstract

Deriving governing equations from observational data, known as Symbolic Regression (SR), is a cornerstone of scientific discovery. Large Language Models, (LLMs) have shown promise in this task by leveraging their vast cross-disciplinary scientific knowledge. However, existing LLM-based methods primarily rely on direct inference or prompt engineering, often requiring excessive inference iterations to converge on correct formulas or failing to treat complex equation targets. These limitations in effectiveness and generalization stem from an inherent tension between pre-trained LLMs' proficiency in approximate reasoning and the high-precision demands of SR tasks. To bridge this gap, we propose to fine-tune LLMs for enhanced SR capability. Yet, the absence of dedicated datasets for SR-oriented fine-tuning remains a critical barrier. We thus introduce SymbArena, specifically engineered to optimize LLMs for SR. This benchmark comprises over 148,000 diverse equations formulated as corpora of 1.83 billion tokens for LLM utilization, enabling effective training and inference. Further, to ensure a more comprehensive and fair evaluation, SymbArena proposes a heuristics metric to precisely quantify form-level consistency, going beyond existing SR numerical-oriented evaluation strategies. With this benchmark, we explore mainstream LLM fine-tuning techniques for SR tasks and establish Symbolic-R1, a simple yet effective LLM-based SR strong baseline. Experimental results validate Symbolic-R1 as the first LLM to exceed traditional numerical methods in both numerical precision and symbolic form accuracy, outperforming the second-best LLM baseline with improvements of 2-fold gains in R2 score and 10.3% in form-level consistency score.

Paper Structure

This paper contains 32 sections, 6 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparison of our method against baselines on numerical accuracy $R^2$ and form-level consistency, showing both average performance (a, c) and a representative case study (b, d). The results reveal that the iterative approach of baseline methods is largely ineffective, with performance plateauing just above a random baseline. This suggests their process is closer to an exhaustive search than true inference. In contrast, our model achieves significant results in its first inference and reasons out the correct equation much faster (b, d). The substantial gains shown in (a) and (c) further confirm our model's high effectiveness and generalization.
  • Figure 2: Workflow of equation generation. (A) Define the symbolic space with a library of operators and terminals. (B) Apply structural constraints to select valid components. (C) Construct tree-based expressions incrementally. (D) Enforce validity and filter by complexity. (E) Check real-world applicability via similarity scoring.
  • Figure 3: Overview of the proposed Symbolic-R1 framework. It consists of a two-stage training phase (1. Instruction Tuning; 2. Reinforcement Tuning with Form-GRPO) followed by an inference phase, where a Hypothesis–Experiment–Revision loop is used to refine equations.