Symbolic Regression with a Learned Concept Library

Arya Grayeli; Atharva Sehgal; Omar Costilla-Reyes; Miles Cranmer; Swarat Chaudhuri

Symbolic Regression with a Learned Concept Library

Arya Grayeli, Atharva Sehgal, Omar Costilla-Reyes, Miles Cranmer, Swarat Chaudhuri

TL;DR

LaSR introduces a latent-concept library for symbolic regression (SR) that is learned and refined through zero-shot prompts to large language models (LLMs). By alternating between concept-guided hypothesis evolution, abstraction of patterns into concepts, and evolution of concepts themselves, LaSR combines evolutionary search with language-based priors to accelerate discovery of compact, interpretable equations. Empirically, LaSR outperforms state-of-the-art SR baselines on the Feynman equations and synthetic benchmarks, and its framework enables discovering novel LLM scaling laws from BigBench data. The work demonstrates that integrating a learned conceptual prior into SR can substantially improve search efficiency and interpretability, with potential extensions beyond SR and into broader scientific discovery tasks.

Abstract

We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LaSR can be used to discover a novel and powerful scaling law for LLMs.

Symbolic Regression with a Learned Concept Library

TL;DR

Abstract

Paper Structure (44 sections, 8 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 44 sections, 8 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Problem Formulation
Symbolic Regression.
Symbolic Regression with Latent Concept Libraries.
Method
Base Algorithm: PySR.
LLM-guided Hypothesis Evolution.
Concept Abstraction.
Concept Evolution.
Experiments
Comparison against baselines in the Feynman Equation Dataset
Cascading Experiments
Ablation Experiments
Qualitative Analysis and User Hints
Data Leakage Validation
...and 29 more sections

Figures (7)

Figure 1: An overview of LaSR. LaSR iteratively refines a library of interpretable textual concepts which are used to bias the search for hypotheses for scientific discovery tasks. This involves three distinct phases: (Top) finding optimal hypotheses within a concept-directed hypothesis evolution, (Right) leveraging the optimal hypotheses to find new concept abstractions, and (Left) iterating on learned concepts to discover new concepts to accelerate hypothesis evolution. LaSR introduces an orthogonal direction of improvement over current symbolic regression algorithms cranmer2023interpretable (in gray).
Figure 2: A single step of LaSR. LaSR induces multiple hypothesis populations that are evolved using a scalable evolutionary algorithm. Concept guidance is provided by randomly replacing symbolic operations with concept-directed LLM operations with probability $p$. After each iteration, the top-performing programs are summarized into natural language concepts, which are evolved to form new concepts that are sampled to guide the search in the next iteration.
Figure 3: Evaluation results for ablations/extensions of LaSR. (Left): We ablate three components of LaSR: Concept Evolution, Concept Library, and variable names and evaluate their MSE solve rate performance on the Feynman dataset over 40 iterations. We find that each component contributes to accelerating search at different stages in the search process. (Right): We extend LaSR by providing an initial concept library $\mathcal{C}_0$ in the form of user provided hints. We find that natural language hints significantly increases the speed of solving equations.
Figure 4: LlmCrossover prompt with an example output. LlmMutation and LlmInit follow the same structure but with slightly different wording and with one and no reference expressions, respectively. Variables within double braces are replaced with the instance specific arguments. These prompts are available in prompts/*.txt in the linked repository.
Figure 5: LLM Concept Abstraction prompt with an example output. The LLM Concept Crossover function follows a similar structure, with a modified task description for crossover on concepts.
...and 2 more figures

Symbolic Regression with a Learned Concept Library

TL;DR

Abstract

Symbolic Regression with a Learned Concept Library

Authors

TL;DR

Abstract

Table of Contents

Figures (7)