Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery
Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku
TL;DR
The paper argues that existing symbolic regression benchmarks inadequately support scientific discovery due to missing physical semantics and simplistic sampling. It constructs 120 SRSD-Feynman datasets and an additional 120 with dummy variables, and introduces a tree-based normalized edit distance (NED) to measure structural closeness between predicted and true expressions, defined as $\overline{d}(f_{pred}, f_{true}) = \min\left(1, \frac{d\left(f_{pred}, f_{true}\right)}{|f_{true}|}\right)$. Through large-scale experiments with baselines including $u$DSR, PySR, AFP, AIF, and DSR, the study reveals SRSD tasks are more challenging than prior SRBench problems and that $R^2$-based accuracy is vulnerable to dummy variables, whereas NED aligns better with human judgments (e.g., $PCC$ for $R^2$ ≈ 0.913 with $p=4.66\times10^{-3}$ vs $PCC$ for NED ≈ -0.416 with $p=1.85\times10^{-24}$). The authors release the SRSD datasets and code under open licenses to foster ongoing research in symbolic regression for scientific discovery and to guide practitioners in method selection.
Abstract
This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct benchmark experiments on our new SRSD datasets using various representative SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric. We publish repositories of our code and 240 SRSD datasets.
