Table of Contents
Fetching ...

Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery

Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku

TL;DR

The paper argues that existing symbolic regression benchmarks inadequately support scientific discovery due to missing physical semantics and simplistic sampling. It constructs 120 SRSD-Feynman datasets and an additional 120 with dummy variables, and introduces a tree-based normalized edit distance (NED) to measure structural closeness between predicted and true expressions, defined as $\overline{d}(f_{pred}, f_{true}) = \min\left(1, \frac{d\left(f_{pred}, f_{true}\right)}{|f_{true}|}\right)$. Through large-scale experiments with baselines including $u$DSR, PySR, AFP, AIF, and DSR, the study reveals SRSD tasks are more challenging than prior SRBench problems and that $R^2$-based accuracy is vulnerable to dummy variables, whereas NED aligns better with human judgments (e.g., $PCC$ for $R^2$ ≈ 0.913 with $p=4.66\times10^{-3}$ vs $PCC$ for NED ≈ -0.416 with $p=1.85\times10^{-24}$). The authors release the SRSD datasets and code under open licenses to foster ongoing research in symbolic regression for scientific discovery and to guide practitioners in method selection.

Abstract

This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct benchmark experiments on our new SRSD datasets using various representative SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric. We publish repositories of our code and 240 SRSD datasets.

Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery

TL;DR

The paper argues that existing symbolic regression benchmarks inadequately support scientific discovery due to missing physical semantics and simplistic sampling. It constructs 120 SRSD-Feynman datasets and an additional 120 with dummy variables, and introduces a tree-based normalized edit distance (NED) to measure structural closeness between predicted and true expressions, defined as . Through large-scale experiments with baselines including DSR, PySR, AFP, AIF, and DSR, the study reveals SRSD tasks are more challenging than prior SRBench problems and that -based accuracy is vulnerable to dummy variables, whereas NED aligns better with human judgments (e.g., for ≈ 0.913 with vs for NED ≈ -0.416 with ). The authors release the SRSD datasets and code under open licenses to foster ongoing research in symbolic regression for scientific discovery and to guide practitioners in method selection.

Abstract

This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct benchmark experiments on our new SRSD datasets using various representative SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric. We publish repositories of our code and 240 SRSD datasets.
Paper Structure (29 sections, 5 equations, 2 figures, 22 tables)

This paper contains 29 sections, 5 equations, 2 figures, 22 tables.

Figures (2)

  • Figure 1: Distribution map of three subsets for our SRSD datasets with respect to our complexity metrics of SR problem. Data points at top right/bottom left indicate more/less complex problems.
  • Figure 2: Example of preprocessing a true equation (III.7.38 in Table \ref{['table:easy2']}) in evaluation session. When converting to an equation tree, we replace constant values and variables with specific symbols e.g., $8.32647716907439 \times 10^{-33} \rightarrow C, \mu \rightarrow X_1, B \rightarrow X_2$.