Table of Contents
Fetching ...

A Transformer Model for Symbolic Regression towards Scientific Discovery

Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku

TL;DR

This paper addresses the interpretability gap in symbolic regression by introducing Transformer-based SR models tailored for scientific discovery. It develops three encoder variants (MLP, Att, Mix) and shows that a Mix-based encoder with label-smoothing delivers strong generalization and near-instant inference on SRSD benchmarks. Using a large synthetic training set and the SRSD evaluation framework, it achieves state-of-the-art performance on medium and hard SRSD problems via the normalized tree edit distance, while acknowledging limitations in easy cases and out-of-domain generalization. The work offers practical impact for automated scientific discovery by enabling fast, interpretable equation discovery and provides open-source code for further research.

Abstract

Symbolic Regression (SR) searches for mathematical expressions which best describe numerical datasets. This allows to circumvent interpretation issues inherent to artificial neural networks, but SR algorithms are often computationally expensive. This work proposes a new Transformer model aiming at Symbolic Regression particularly focused on its application for Scientific Discovery. We propose three encoder architectures with increasing flexibility but at the cost of column-permutation equivariance violation. Training results indicate that the most flexible architecture is required to prevent from overfitting. Once trained, we apply our best model to the SRSD datasets (Symbolic Regression for Scientific Discovery datasets) which yields state-of-the-art results using the normalized tree-based edit distance, at no extra computational cost.

A Transformer Model for Symbolic Regression towards Scientific Discovery

TL;DR

This paper addresses the interpretability gap in symbolic regression by introducing Transformer-based SR models tailored for scientific discovery. It develops three encoder variants (MLP, Att, Mix) and shows that a Mix-based encoder with label-smoothing delivers strong generalization and near-instant inference on SRSD benchmarks. Using a large synthetic training set and the SRSD evaluation framework, it achieves state-of-the-art performance on medium and hard SRSD problems via the normalized tree edit distance, while acknowledging limitations in easy cases and out-of-domain generalization. The work offers practical impact for automated scientific discovery by enabling fast, interpretable equation discovery and provides open-source code for further research.

Abstract

Symbolic Regression (SR) searches for mathematical expressions which best describe numerical datasets. This allows to circumvent interpretation issues inherent to artificial neural networks, but SR algorithms are often computationally expensive. This work proposes a new Transformer model aiming at Symbolic Regression particularly focused on its application for Scientific Discovery. We propose three encoder architectures with increasing flexibility but at the cost of column-permutation equivariance violation. Training results indicate that the most flexible architecture is required to prevent from overfitting. Once trained, we apply our best model to the SRSD datasets (Symbolic Regression for Scientific Discovery datasets) which yields state-of-the-art results using the normalized tree-based edit distance, at no extra computational cost.
Paper Structure (12 sections, 4 equations, 6 figures, 3 tables)

This paper contains 12 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Architecture of our Transformer model for Symbolic Regression. We propose three encoder architectures: MLP, Att, or Mix. The decoder is a standard Transformer decoder and is the same in all cases. During training, the encoder receives the tabular dataset and the decoder receives the ground-truth sequence of tokens, used with teacher-forcing method. During inference, the decoder is on its own and predicts tokens in an auto-regressive manner.
  • Figure 2: Loss function and token-wise accuracy during training without label-smoothing. The MLP encoder architecture strongly overfits the training set and cannot generalize to the validation/test sets. The Att encoder architecture can somehow generalize to the validation/test sets but still shows some overfitting. The Mix architecture shows no overfit sign at all.
  • Figure 3: Loss function and token-wise accuracy during training with $\epsilon=0.1$ label-smoothing. The same statement as Figure \ref{['fig:loss_accuracy_noLS']} applies. Label-smoothing does not resolve the overfitting problem.
  • Figure 4: Encoder architecture -- MLP version. This encoder architecture preserves row permutation invariance and column permutation equivariance. After MaxPooling, the features are tiled and concatenated to the original tensor. This architecture does not allow for much flexibility in feature design between variables.
  • Figure 5: Encoder architecture -- Att version. The self-attention mechanism preserves row permutation invariance and column permutation equivariance, and is more flexible in feature design.
  • ...and 1 more figures