Shape Arithmetic Expressions: Advancing Scientific Discovery Beyond Closed-Form Equations
Krzysztof Kacprzyk, Mihaela van der Schaar
TL;DR
The paper tackles the limitation that symbolic regression often cannot yield compact, interpretable expressions for non-closed-form relationships while GAMs miss intricate interactions. It proposes SHAPE ARITHMETIC EXPRESSIONS (SHAREs), a unifying model class that combines GAM-like univariate shape functions with interaction-capable expression trees, under a rule-based transparency framework. The authors formalize SHAREs, establish theoretical properties on size and depth, and demonstrate via experiments (including a torque and a temperature problem) that SHAREs can outperform both SR and GAMs while preserving interpretability. This work advances AI4Science by enabling transparent, interaction-aware discovery of scientific relationships from data, with potential applicability across physics, biology, and engineering.
Abstract
Symbolic regression has excelled in uncovering equations from physics, chemistry, biology, and related disciplines. However, its effectiveness becomes less certain when applied to experimental data lacking inherent closed-form expressions. Empirically derived relationships, such as entire stress-strain curves, may defy concise closed-form representation, compelling us to explore more adaptive modeling approaches that balance flexibility with interpretability. In our pursuit, we turn to Generalized Additive Models (GAMs), a widely used class of models known for their versatility across various domains. Although GAMs can capture non-linear relationships between variables and targets, they cannot capture intricate feature interactions. In this work, we investigate both of these challenges and propose a novel class of models, Shape Arithmetic Expressions (SHAREs), that fuses GAM's flexible shape functions with the complex feature interactions found in mathematical expressions. SHAREs also provide a unifying framework for both of these approaches. We also design a set of rules for constructing SHAREs that guarantee transparency of the found expressions beyond the standard constraints based on the model's size.
