Table of Contents
Fetching ...

Minimum variance threshold for epsilon-lexicase selection

Guilherme Seidyo Imai Aldeia, Fabricio Olivetti de Franca, William G. La Cava

TL;DR

This work targets the limitation of information loss in epsilon-lexicase parent selection for symbolic regression by replacing the MAD-based threshold with a Minimum Variance Threshold that partitions errors into two clusters to minimize within-partition variance. Implemented in FEAT and evaluated on SRBench across real and synthetic tasks, the approach (notably the dynamic D-Split variant) yields competitive or improved real-world performance while maintaining model complexity. The results suggest that clustering-based thresholding can better capture nuanced performance differences across test cases, though it increases the number of test evaluations and runtime. Overall, the method offers a principled, threshold-based improvement to selection that enhances practical impact in RS/GP pipelines and motivates further exploration of down-sampling and broader applicability.

Abstract

Parent selection plays an important role in evolutionary algorithms, and many strategies exist to select the parent pool before breeding the next generation. Methods often rely on average error over the entire dataset as a criterion to select the parents, which can lead to an information loss due to aggregation of all test cases. Under epsilon-lexicase selection, the population goes to a selection pool that is iteratively reduced by using each test individually, discarding individuals with an error higher than the elite error plus the median absolute deviation (MAD) of errors for that particular test case. In an attempt to better capture differences in performance of individuals on cases, we propose a new criteria that splits errors into two partitions that minimize the total variance within partitions. Our method was embedded into the FEAT symbolic regression algorithm, and evaluated with the SRBench framework, containing 122 black-box synthetic and real-world regression problems. The empirical results show a better performance of our approach compared to traditional epsilon-lexicase selection in the real-world datasets while showing equivalent performance on the synthetic dataset.

Minimum variance threshold for epsilon-lexicase selection

TL;DR

This work targets the limitation of information loss in epsilon-lexicase parent selection for symbolic regression by replacing the MAD-based threshold with a Minimum Variance Threshold that partitions errors into two clusters to minimize within-partition variance. Implemented in FEAT and evaluated on SRBench across real and synthetic tasks, the approach (notably the dynamic D-Split variant) yields competitive or improved real-world performance while maintaining model complexity. The results suggest that clustering-based thresholding can better capture nuanced performance differences across test cases, though it increases the number of test evaluations and runtime. Overall, the method offers a principled, threshold-based improvement to selection that enhances practical impact in RS/GP pipelines and motivates further exploration of down-sampling and broader applicability.

Abstract

Parent selection plays an important role in evolutionary algorithms, and many strategies exist to select the parent pool before breeding the next generation. Methods often rely on average error over the entire dataset as a criterion to select the parents, which can lead to an information loss due to aggregation of all test cases. Under epsilon-lexicase selection, the population goes to a selection pool that is iteratively reduced by using each test individually, discarding individuals with an error higher than the elite error plus the median absolute deviation (MAD) of errors for that particular test case. In an attempt to better capture differences in performance of individuals on cases, we propose a new criteria that splits errors into two partitions that minimize the total variance within partitions. Our method was embedded into the FEAT symbolic regression algorithm, and evaluated with the SRBench framework, containing 122 black-box synthetic and real-world regression problems. The empirical results show a better performance of our approach compared to traditional epsilon-lexicase selection in the real-world datasets while showing equivalent performance on the synthetic dataset.
Paper Structure (11 sections, 4 equations, 13 figures, 4 tables)

This paper contains 11 sections, 4 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Process of consecutively splitting the pool of individuals into two clusters based on their error ($x$ axis). The process consists of randomly picking a test case, estimating $\tau^*$ by solving Eq. \ref{['eq:split_threshold']}, and removing individuals with errors higher than the pool threshold. This is repeated until only one individual remains in the pool, or all training data was already used as singular test cases --- returning one random individual from the remaining pool.
  • Figure 2: An individual in FEAT is a collection of symbolic regression trees as meta-features for any machine learning model.
  • Figure 3: Dimensionality of the SRBench datasets.
  • Figure 4: Convergence loss of the best individual on validation partition for the six problems.
  • Figure 5: Median number of test cases used to pick each parent for the six problems.
  • ...and 8 more figures