Interpretable Scientific Discovery with Symbolic Regression: A Review
Nour Makke, Sanjay Chawla
TL;DR
This survey surveys Symbolic Regression (SR) with a focus on interpretability, contrasting traditional black-box models and highlighting the discrete, library-driven search space that SR navigates to uncover governing equations. It contrasts linear SR, nonlinear neural-symbolic SR (e.g., Equation Learner), and tree-based approaches (GP, transformers, RL), detailing how each represents expressions and handles search, representation, and optimization. The review covers diverse applications and benchmarks (e.g., Feynman, Nguyen, SRBench), discusses current limitations such as library dependence and scaling, and argues that integrative approaches combining domain knowledge with powerful search architectures hold the most promise for data-driven scientific discovery. Overall, SR balances interpretability and predictive accuracy, with tree-based and physics-/mathematics-inspired methods offering the strongest performance, and real-data applications signaling significant future impact in physical sciences and beyond.
Abstract
Symbolic regression is emerging as a promising machine learning method for learning succinct underlying interpretable mathematical expressions directly from data. Whereas it has been traditionally tackled with genetic programming, it has recently gained a growing interest in deep learning as a data-driven model discovery method, achieving significant advances in various application domains ranging from fundamental to applied sciences. This survey presents a structured and comprehensive overview of symbolic regression methods and discusses their strengths and limitations.
