Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking

Guilherme Seidyo Imai Aldeia

Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking

Guilherme Seidyo Imai Aldeia

TL;DR

This work analyzes four enduring challenges in Symbolic Regression—parameter optimization, parent selection, model simplification, and benchmarking—through a unified framework built around Brush. It shows that separating linear and non-linear parameter optimization can improve accuracy at the cost of runtime and model size, and introduces a novel Minimum Variance Threshold variant for epsilon-lexicase selection that balances convergence and diversity. A fast, data-driven inexact simplification method using Locality-Sensitive Hashing reduces bloating and yields simpler, often more accurate models. Complementing these contributions, a large-scale SR benchmark (SRBench) is updated to provide robust, diverse evaluations across both black-box and first-principles problems, clarifying current SR capabilities and guiding future research toward more interpretable, efficient, and domain-relevant SR methods.

Abstract

Symbolic Regression (SR) is a regression method that aims to discover mathematical expressions that describe the relationship between variables, and it is often implemented through Genetic Programming, a metaphor for the process of biological evolution. Its appeal lies in combining predictive accuracy with interpretable models, but its promise is limited by several long-standing challenges: parameters are difficult to optimize, the selection of solutions can affect the search, and models often grow unnecessarily complex. In addition, current methods must be constantly re-evaluated to understand the SR landscape. This thesis addresses these challenges through a sequence of studies conducted throughout the doctorate, each focusing on an important aspect of the SR search process. First, I investigate parameter optimization, obtaining insights into its role in improving predictive accuracy, albeit with trade-offs in runtime and expression size. Next, I study parent selection, exploring $ε$-lexicase to select parents more likely to generate good performing offspring. The focus then turns to simplification, where I introduce a novel method based on memoization and locality-sensitive hashing that reduces redundancy and yields simpler, more accurate models. All of these contributions are implemented into a multi-objective evolutionary SR library, which achieves Pareto-optimal performance in terms of accuracy and simplicity on benchmarks of real-world and synthetic problems, outperforming several contemporary SR approaches. The thesis concludes by proposing changes to a famous large-scale symbolic regression benchmark suite, then running the experiments to assess the symbolic regression landscape, demonstrating that a SR method with the contributions presented in this thesis achieves Pareto-optimal performance.

Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking

TL;DR

Abstract

Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (46)