Symbolic Regression with a Learned Concept Library
Arya Grayeli, Atharva Sehgal, Omar Costilla-Reyes, Miles Cranmer, Swarat Chaudhuri
TL;DR
LaSR introduces a latent-concept library for symbolic regression (SR) that is learned and refined through zero-shot prompts to large language models (LLMs). By alternating between concept-guided hypothesis evolution, abstraction of patterns into concepts, and evolution of concepts themselves, LaSR combines evolutionary search with language-based priors to accelerate discovery of compact, interpretable equations. Empirically, LaSR outperforms state-of-the-art SR baselines on the Feynman equations and synthetic benchmarks, and its framework enables discovering novel LLM scaling laws from BigBench data. The work demonstrates that integrating a learned conceptual prior into SR can substantially improve search efficiency and interpretability, with potential extensions beyond SR and into broader scientific discovery tasks.
Abstract
We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LaSR can be used to discover a novel and powerful scaling law for LLMs.
