Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models

Nathan Haut; Zenas Huang; Adam Alessio

Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models

Nathan Haut, Zenas Huang, Adam Alessio

TL;DR

The paper tackles the challenge of generalizable symbolic regression in GP by introducing a data-informed model complexity metric that aligns a model's effective dimensionality (ED) with the dataset's intrinsic dimensionality (ID). ED is estimated from the rank of an average Hessian computed at three representative points, while ID-based guidance derives target ED ranges from twelve estimators in scikit-dimension, enabling post-processing selection of models via ED within $[ID_{min}, ID_{max}]$. Across Penn ML Benchmark datasets with StackGP, the approach identifies an ideal complexity window that yields better generalization and reduces bias toward under- or over-complex solutions, despite the added computational cost of Hessian evaluation. The work demonstrates that three-point Hessian sampling can effectively approximate ED and that integrating ID estimates into selection improves Pareto-front quality, offering a practical pathway to more interpretable and reliable symbolic regression models.

Abstract

Choosing models from a well-fitted evolved population that generalizes beyond training data is difficult. We introduce a pragmatic method to estimate model complexity using Hessian rank for post-processing selection. Complexity is approximated by averaging the model output Hessian rank across a few points (N=3), offering efficient and accurate rank estimates. This method aligns model selection with input data complexity, calculated using intrinsic dimensionality (ID) estimators. Using the StackGP system, we develop symbolic regression models for the Penn Machine Learning Benchmark and employ twelve scikit-dimension library methods to estimate ID, aligning model expressiveness with dataset ID. Our data-informed complexity metric finds the ideal complexity window, balancing model expressiveness and accuracy, enhancing generalizability without bias common in methods reliant on user-defined parameters, such as parsimony pressure in weight selection.

Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models

TL;DR

Abstract

Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)