Table of Contents
Fetching ...

In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery

Matteo Merler, Katsiaryna Haitsiukevich, Nicola Dainese, Pekka Marttinen

TL;DR

The paper introduces In-Context Symbolic Regression (ICSR), a framework that uses Large Language Models to generate seed symbolic forms and then refines them through an external nonlinear optimization loop to fit data while penalizing expression complexity. By combining LLM-based skeleton generation with a complexity-regularized NMSE objective and coefficients optimized via nonlinear least squares, ICSR achieves competitive or superior performance on four SR benchmarks while producing simpler, more interpretable equations and better out-of-distribution generalization. The work demonstrates that foundation models endowed with mathematical priors can effectively contribute to SR without task-specific fine-tuning, offering a flexible natural-language interface and clear avenues for future enhancements such as domain-aware prompts, chain-of-thought reasoning, and hybrid search strategies. Limitations include context-window constraints and higher-dimensional scaling, suggesting that advances in prompt design and multimodal or hierarchical prompting could further improve robustness and applicability in complex SR tasks.

Abstract

State of the art Symbolic Regression (SR) methods currently build specialized models, while the application of Large Language Models (LLMs) remains largely unexplored. In this work, we introduce the first comprehensive framework that utilizes LLMs for the task of SR. We propose In-Context Symbolic Regression (ICSR), an SR method which iteratively refines a functional form with an LLM and determines its coefficients with an external optimizer. ICSR leverages LLMs' strong mathematical prior both to propose an initial set of possible functions given the observations and to refine them based on their errors. Our findings reveal that LLMs are able to successfully find symbolic equations that fit the given data, matching or outperforming the overall performance of the best SR baselines on four popular benchmarks, while yielding simpler equations with better out of distribution generalization.

In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery

TL;DR

The paper introduces In-Context Symbolic Regression (ICSR), a framework that uses Large Language Models to generate seed symbolic forms and then refines them through an external nonlinear optimization loop to fit data while penalizing expression complexity. By combining LLM-based skeleton generation with a complexity-regularized NMSE objective and coefficients optimized via nonlinear least squares, ICSR achieves competitive or superior performance on four SR benchmarks while producing simpler, more interpretable equations and better out-of-distribution generalization. The work demonstrates that foundation models endowed with mathematical priors can effectively contribute to SR without task-specific fine-tuning, offering a flexible natural-language interface and clear avenues for future enhancements such as domain-aware prompts, chain-of-thought reasoning, and hybrid search strategies. Limitations include context-window constraints and higher-dimensional scaling, suggesting that advances in prompt design and multimodal or hierarchical prompting could further improve robustness and applicability in complex SR tasks.

Abstract

State of the art Symbolic Regression (SR) methods currently build specialized models, while the application of Large Language Models (LLMs) remains largely unexplored. In this work, we introduce the first comprehensive framework that utilizes LLMs for the task of SR. We propose In-Context Symbolic Regression (ICSR), an SR method which iteratively refines a functional form with an LLM and determines its coefficients with an external optimizer. ICSR leverages LLMs' strong mathematical prior both to propose an initial set of possible functions given the observations and to refine them based on their errors. Our findings reveal that LLMs are able to successfully find symbolic equations that fit the given data, matching or outperforming the overall performance of the best SR baselines on four popular benchmarks, while yielding simpler equations with better out of distribution generalization.
Paper Structure (36 sections, 3 equations, 11 figures, 6 tables)

This paper contains 36 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: High level overview of the ICSR approach. Given an initial set of observations, we prompt the LLM to generate multiple initial guesses (seeds) of the true function that generated the observations. We then iteratively refine our guesses within an optimization loop where we propose new functions (based on a set of the previous best attempts), fit their coefficients and evaluate their fitness. The model only produces the functional form of a function, while the unknown coefficients are fitted using non-linear least squares optimization.
  • Figure 2: Comparison across baselines on out of distribution data. We compared the proposed method with the baselines by increasing the input domain for the generated functions. Whenever the $R^2$ becomes negative, we fix it to 0 when computing the average for the figure on the left and report the fraction of negative values in the figure on the right.
  • Figure 3: Out of distribution examples. Qualitative examples demonstrating the generalization capabilities of ICSR and uDSR on two experiments. The higher complexity from the uDSR examples introduces unnecessary terms that harm the out of distribution performance (area shaded in red).
  • Figure 4: Example of plots used with the VLM. (a) Scatter plot of the observations used when generating the seed functions. (b) Plot of the best function from a previous iteration used in the optimization loop.
  • Figure 5: Prompt used to generate the seed functions.
  • ...and 6 more figures