Table of Contents
Fetching ...

Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models

Nathan Haut, Zenas Huang, Adam Alessio

TL;DR

The paper tackles the challenge of generalizable symbolic regression in GP by introducing a data-informed model complexity metric that aligns a model's effective dimensionality (ED) with the dataset's intrinsic dimensionality (ID). ED is estimated from the rank of an average Hessian computed at three representative points, while ID-based guidance derives target ED ranges from twelve estimators in scikit-dimension, enabling post-processing selection of models via ED within $[ID_{min}, ID_{max}]$. Across Penn ML Benchmark datasets with StackGP, the approach identifies an ideal complexity window that yields better generalization and reduces bias toward under- or over-complex solutions, despite the added computational cost of Hessian evaluation. The work demonstrates that three-point Hessian sampling can effectively approximate ED and that integrating ID estimates into selection improves Pareto-front quality, offering a practical pathway to more interpretable and reliable symbolic regression models.

Abstract

Choosing models from a well-fitted evolved population that generalizes beyond training data is difficult. We introduce a pragmatic method to estimate model complexity using Hessian rank for post-processing selection. Complexity is approximated by averaging the model output Hessian rank across a few points (N=3), offering efficient and accurate rank estimates. This method aligns model selection with input data complexity, calculated using intrinsic dimensionality (ID) estimators. Using the StackGP system, we develop symbolic regression models for the Penn Machine Learning Benchmark and employ twelve scikit-dimension library methods to estimate ID, aligning model expressiveness with dataset ID. Our data-informed complexity metric finds the ideal complexity window, balancing model expressiveness and accuracy, enhancing generalizability without bias common in methods reliant on user-defined parameters, such as parsimony pressure in weight selection.

Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models

TL;DR

The paper tackles the challenge of generalizable symbolic regression in GP by introducing a data-informed model complexity metric that aligns a model's effective dimensionality (ED) with the dataset's intrinsic dimensionality (ID). ED is estimated from the rank of an average Hessian computed at three representative points, while ID-based guidance derives target ED ranges from twelve estimators in scikit-dimension, enabling post-processing selection of models via ED within . Across Penn ML Benchmark datasets with StackGP, the approach identifies an ideal complexity window that yields better generalization and reduces bias toward under- or over-complex solutions, despite the added computational cost of Hessian evaluation. The work demonstrates that three-point Hessian sampling can effectively approximate ED and that integrating ID estimates into selection improves Pareto-front quality, offering a practical pathway to more interpretable and reliable symbolic regression models.

Abstract

Choosing models from a well-fitted evolved population that generalizes beyond training data is difficult. We introduce a pragmatic method to estimate model complexity using Hessian rank for post-processing selection. Complexity is approximated by averaging the model output Hessian rank across a few points (N=3), offering efficient and accurate rank estimates. This method aligns model selection with input data complexity, calculated using intrinsic dimensionality (ID) estimators. Using the StackGP system, we develop symbolic regression models for the Penn Machine Learning Benchmark and employ twelve scikit-dimension library methods to estimate ID, aligning model expressiveness with dataset ID. Our data-informed complexity metric finds the ideal complexity window, balancing model expressiveness and accuracy, enhancing generalizability without bias common in methods reliant on user-defined parameters, such as parsimony pressure in weight selection.

Paper Structure

This paper contains 20 sections, 4 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Complexity Pressure Comparison. Shown here are the complexity pressures imposed by tournament selection and standard complexity metrics used in Pareto tournament selection and parsimony pressure compared to our proposed complexity metric. The plateau in our metric represents the target ID range since our approach does not assume a single point of optimal complexity.
  • Figure 2: ID estimates by algorithm across datasets of different number of input features demonstrating the high variance in estimated ID on datasets with similarly sized input feature counts.
  • Figure 3: Distribution of ID estimates for each method on all datasets further highlighting the variance across methods.
  • Figure 4: Complexity distributions of a model population when using Pareto tournament selection, standard tournament selection (complexity & accuracy), and when incorporating ID as a third objective for Pareto tournament selection using the "auto_price" dataset.
  • Figure 5: Complexity vs Fitness (1-$R^2$) Histogram for Pareto Tournament Selection. Shown here is a histogram representing the distribution of models in a population after evolution which used Pareto tournament selection. The results show that a significant portion fall into the upper left quadrant which represents models which are overly simple and not very accurate.
  • ...and 3 more figures