Table of Contents
Fetching ...

Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking

Guilherme Seidyo Imai Aldeia

TL;DR

This work analyzes four enduring challenges in Symbolic Regression—parameter optimization, parent selection, model simplification, and benchmarking—through a unified framework built around Brush. It shows that separating linear and non-linear parameter optimization can improve accuracy at the cost of runtime and model size, and introduces a novel Minimum Variance Threshold variant for epsilon-lexicase selection that balances convergence and diversity. A fast, data-driven inexact simplification method using Locality-Sensitive Hashing reduces bloating and yields simpler, often more accurate models. Complementing these contributions, a large-scale SR benchmark (SRBench) is updated to provide robust, diverse evaluations across both black-box and first-principles problems, clarifying current SR capabilities and guiding future research toward more interpretable, efficient, and domain-relevant SR methods.

Abstract

Symbolic Regression (SR) is a regression method that aims to discover mathematical expressions that describe the relationship between variables, and it is often implemented through Genetic Programming, a metaphor for the process of biological evolution. Its appeal lies in combining predictive accuracy with interpretable models, but its promise is limited by several long-standing challenges: parameters are difficult to optimize, the selection of solutions can affect the search, and models often grow unnecessarily complex. In addition, current methods must be constantly re-evaluated to understand the SR landscape. This thesis addresses these challenges through a sequence of studies conducted throughout the doctorate, each focusing on an important aspect of the SR search process. First, I investigate parameter optimization, obtaining insights into its role in improving predictive accuracy, albeit with trade-offs in runtime and expression size. Next, I study parent selection, exploring $ε$-lexicase to select parents more likely to generate good performing offspring. The focus then turns to simplification, where I introduce a novel method based on memoization and locality-sensitive hashing that reduces redundancy and yields simpler, more accurate models. All of these contributions are implemented into a multi-objective evolutionary SR library, which achieves Pareto-optimal performance in terms of accuracy and simplicity on benchmarks of real-world and synthetic problems, outperforming several contemporary SR approaches. The thesis concludes by proposing changes to a famous large-scale symbolic regression benchmark suite, then running the experiments to assess the symbolic regression landscape, demonstrating that a SR method with the contributions presented in this thesis achieves Pareto-optimal performance.

Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking

TL;DR

This work analyzes four enduring challenges in Symbolic Regression—parameter optimization, parent selection, model simplification, and benchmarking—through a unified framework built around Brush. It shows that separating linear and non-linear parameter optimization can improve accuracy at the cost of runtime and model size, and introduces a novel Minimum Variance Threshold variant for epsilon-lexicase selection that balances convergence and diversity. A fast, data-driven inexact simplification method using Locality-Sensitive Hashing reduces bloating and yields simpler, often more accurate models. Complementing these contributions, a large-scale SR benchmark (SRBench) is updated to provide robust, diverse evaluations across both black-box and first-principles problems, clarifying current SR capabilities and guiding future research toward more interpretable, efficient, and domain-relevant SR methods.

Abstract

Symbolic Regression (SR) is a regression method that aims to discover mathematical expressions that describe the relationship between variables, and it is often implemented through Genetic Programming, a metaphor for the process of biological evolution. Its appeal lies in combining predictive accuracy with interpretable models, but its promise is limited by several long-standing challenges: parameters are difficult to optimize, the selection of solutions can affect the search, and models often grow unnecessarily complex. In addition, current methods must be constantly re-evaluated to understand the SR landscape. This thesis addresses these challenges through a sequence of studies conducted throughout the doctorate, each focusing on an important aspect of the SR search process. First, I investigate parameter optimization, obtaining insights into its role in improving predictive accuracy, albeit with trade-offs in runtime and expression size. Next, I study parent selection, exploring -lexicase to select parents more likely to generate good performing offspring. The focus then turns to simplification, where I introduce a novel method based on memoization and locality-sensitive hashing that reduces redundancy and yields simpler, more accurate models. All of these contributions are implemented into a multi-objective evolutionary SR library, which achieves Pareto-optimal performance in terms of accuracy and simplicity on benchmarks of real-world and synthetic problems, outperforming several contemporary SR approaches. The thesis concludes by proposing changes to a famous large-scale symbolic regression benchmark suite, then running the experiments to assess the symbolic regression landscape, demonstrating that a SR method with the contributions presented in this thesis achieves Pareto-optimal performance.

Paper Structure

This paper contains 109 sections, 33 equations, 46 figures, 20 tables, 11 algorithms.

Figures (46)

  • Figure 2.1: Evaluation of a parse tree (left) and a single node (right). The dashed arrows traverse the tree top-down. Each time a node with children is encountered, its children are evaluated first (i.e., through recursive calls). Once their values are returned (continuous arrows), the node's operator is applied to the children's returned values.
  • Figure 2.2: Illustration of a Pareto front (dotted line). The Pareto front consists of the set of non-dominated individuals in a multi-objective setting. Final solutions are usually chosen from this set. Dominated individuals are worse on all objectives when compared with any solution from the Pareto front.
  • Figure 2.3: Flowchart of common steps of evolutionary algorithms. Each iteration corresponds to a generation. Individuals are randomly generated and then iterativally refined through a process that assemble a random search, but with selection and survival steps helping guiding the search towards promissing regions of the hypothesis space.
  • Figure 2.4: Roulette selection, a parent selection mechanism proposed by John Holland, where probabilities are proportional to fitness. Each parent is selected by spinning the wheel.
  • Figure 2.5: Tournament selection, where each parent is selected by winning a tournament of $k$ individuals. This is the most commonly found parent selection schema today. Candidates are randomly assigned to the tournament with uniform probabilities, and the best performing one is selected.
  • ...and 41 more figures