A Transformer Model for Symbolic Regression towards Scientific Discovery
Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku
TL;DR
This paper addresses the interpretability gap in symbolic regression by introducing Transformer-based SR models tailored for scientific discovery. It develops three encoder variants (MLP, Att, Mix) and shows that a Mix-based encoder with label-smoothing delivers strong generalization and near-instant inference on SRSD benchmarks. Using a large synthetic training set and the SRSD evaluation framework, it achieves state-of-the-art performance on medium and hard SRSD problems via the normalized tree edit distance, while acknowledging limitations in easy cases and out-of-domain generalization. The work offers practical impact for automated scientific discovery by enabling fast, interpretable equation discovery and provides open-source code for further research.
Abstract
Symbolic Regression (SR) searches for mathematical expressions which best describe numerical datasets. This allows to circumvent interpretation issues inherent to artificial neural networks, but SR algorithms are often computationally expensive. This work proposes a new Transformer model aiming at Symbolic Regression particularly focused on its application for Scientific Discovery. We propose three encoder architectures with increasing flexibility but at the cost of column-permutation equivariance violation. Training results indicate that the most flexible architecture is required to prevent from overfitting. Once trained, we apply our best model to the SRSD datasets (Symbolic Regression for Scientific Discovery datasets) which yields state-of-the-art results using the normalized tree-based edit distance, at no extra computational cost.
