Table of Contents
Fetching ...

Shape Constraints in Symbolic Regression using Penalized Least Squares

Viktor Martinek, Julia Reuter, Ophelia Frotscher, Sanaz Mostaghim, Markus Richter, Roland Herzog

TL;DR

The paper addresses how to incorporate shape constraints into symbolic regression by penalizing SC violations during the parameter identification step using gradient-based, second-order optimization. It formulates SC as constraints evaluated at a finite set of points and combines their penalties with the SR loss in a soft-constrained framework, implemented within the TiSR NSGA-II-based platform. Through experiments on Gaussian, magman, and van der Waals problems under varying noise and data scarcity, the approach minimizes SC violations during fitting (minim_obj) and is compared to a baseline and a post-hoc SC-penalized variant (obj). Results show that SC helps most when data are limited, with minim_obj providing statistically significant gains in some cases while never performing worse overall, indicating practical utility for extrapolation and prior-knowledge integration in data-sparse regimes. The work points to future extensions to empirical datasets and broader SC types to further amplify benefits in real-world applications.

Abstract

We study the addition of shape constraints (SC) and their consideration during the parameter identification step of symbolic regression (SR). SC serve as a means to introduce prior knowledge about the shape of the otherwise unknown model function into SR. Unlike previous works that have explored SC in SR, we propose minimizing SC violations during parameter identification using gradient-based numerical optimization. We test three algorithm variants to evaluate their performance in identifying three symbolic expressions from synthetically generated data sets. This paper examines two benchmark scenarios: one with varying noise levels and another with reduced amounts of training data. The results indicate that incorporating SC into the expression search is particularly beneficial when data is scarce. Compared to using SC only in the selection process, our approach of minimizing violations during parameter identification shows a statistically significant benefit in some of our test cases, without being significantly worse in any instance.

Shape Constraints in Symbolic Regression using Penalized Least Squares

TL;DR

The paper addresses how to incorporate shape constraints into symbolic regression by penalizing SC violations during the parameter identification step using gradient-based, second-order optimization. It formulates SC as constraints evaluated at a finite set of points and combines their penalties with the SR loss in a soft-constrained framework, implemented within the TiSR NSGA-II-based platform. Through experiments on Gaussian, magman, and van der Waals problems under varying noise and data scarcity, the approach minimizes SC violations during fitting (minim_obj) and is compared to a baseline and a post-hoc SC-penalized variant (obj). Results show that SC helps most when data are limited, with minim_obj providing statistically significant gains in some cases while never performing worse overall, indicating practical utility for extrapolation and prior-knowledge integration in data-sparse regimes. The work points to future extensions to empirical datasets and broader SC types to further amplify benefits in real-world applications.

Abstract

We study the addition of shape constraints (SC) and their consideration during the parameter identification step of symbolic regression (SR). SC serve as a means to introduce prior knowledge about the shape of the otherwise unknown model function into SR. Unlike previous works that have explored SC in SR, we propose minimizing SC violations during parameter identification using gradient-based numerical optimization. We test three algorithm variants to evaluate their performance in identifying three symbolic expressions from synthetically generated data sets. This paper examines two benchmark scenarios: one with varying noise levels and another with reduced amounts of training data. The results indicate that incorporating SC into the expression search is particularly beneficial when data is scarce. Compared to using SC only in the selection process, our approach of minimizing violations during parameter identification shows a statistically significant benefit in some of our test cases, without being significantly worse in any instance.
Paper Structure (14 sections, 9 equations, 6 figures, 2 tables)

This paper contains 14 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 4.1: Simplified illustration of NSGA-II (left) and overview of the key differences between the three algorithm variants with regard to parameter identification, population selection, and hall of fame selection (right).
  • Figure 4.2: Plot of the gaussian expression for varying standard deviation $\sigma$ and random variable values $\theta$ (left) and the magman expression for varying distances $x$ and varying current $I$ (right).
  • Figure 4.3: Plot of the vanderwaals expression, where the pressure $p$ is shown for varying temperatures $T$ and specific volumes $v$ (left), as well as an illustrative example to aid understanding of the Maxwell criterion at $T = 400$. The boiling and dew points are shown and connected with a dashed red line (right).
  • Figure 4.4: Times out of 31 that each of the three algorithmic variants (base, obj, minim_obj) find each of the three ground truth expressions (gaussian, magman, vanderwaals) for noise levels of [round-mode=places, round-precision=0]10, [round-mode=places, round-precision=0]30, and [round-mode=places, round-precision=0]35 noise levels.
  • Figure 4.5: Times out of ten that each algorithmic variant finds the gaussian (upper plot) and the magman (lower plot) expressions at [round-mode=places, round-precision=0]10 noise level for different proportions of data out of 100 data points.
  • ...and 1 more figures