Table of Contents
Fetching ...

Discovering equations from data: symbolic regression in dynamical systems

Beatriz R. Brum, Luiza Lober, Isolde Previdelli, Francisco A. Rodrigues

TL;DR

This work surveys the landscape of symbolic regression (SR) methods for uncovering governing equations from data and analyzes identifiability challenges in dynamical systems. It benchmarks six SR algorithms—GPLearn, AI-Feynman, PySINDy, PySR, PyKAN, and ODEFormer—across nine dynamical systems, including chaotic, oscillatory, predator–prey, and epidemiological models, with PySR delivering the most robust and accurate structural recovery. The results show that several methods can recover governing forms with high fidelity, though performance varies with noise, parameter choices, and system dimensionality; PySR generally dominates in both structural and predictive accuracy. The study underscores the potential of SR for real-world equation discovery while outlining practical limitations, such as identifiability, NP-hardness, and sensitivity to data quality, and it calls for expanded benchmarks and robust noise-handling strategies.

Abstract

The process of discovering equations from data lies at the heart of physics and in many other areas of research, including mathematical ecology and epidemiology. Recently, machine learning methods known as symbolic regression emerged as a way to automate this task. This study presents an overview of the current literature on symbolic regression, while also comparing the efficiency of five state-of-the-art methods in recovering the governing equations from nine processes, including chaotic dynamics and epidemic models. Benchmark results demonstrate the PySR method as the most suitable for inferring equations, with some estimates being indistinguishable from the original analytical forms. These results highlight the potential of symbolic regression as a robust tool for inferring and modeling real-world phenomena.

Discovering equations from data: symbolic regression in dynamical systems

TL;DR

This work surveys the landscape of symbolic regression (SR) methods for uncovering governing equations from data and analyzes identifiability challenges in dynamical systems. It benchmarks six SR algorithms—GPLearn, AI-Feynman, PySINDy, PySR, PyKAN, and ODEFormer—across nine dynamical systems, including chaotic, oscillatory, predator–prey, and epidemiological models, with PySR delivering the most robust and accurate structural recovery. The results show that several methods can recover governing forms with high fidelity, though performance varies with noise, parameter choices, and system dimensionality; PySR generally dominates in both structural and predictive accuracy. The study underscores the potential of SR for real-world equation discovery while outlining practical limitations, such as identifiability, NP-hardness, and sensitivity to data quality, and it calls for expanded benchmarks and robust noise-handling strategies.

Abstract

The process of discovering equations from data lies at the heart of physics and in many other areas of research, including mathematical ecology and epidemiology. Recently, machine learning methods known as symbolic regression emerged as a way to automate this task. This study presents an overview of the current literature on symbolic regression, while also comparing the efficiency of five state-of-the-art methods in recovering the governing equations from nine processes, including chaotic dynamics and epidemic models. Benchmark results demonstrate the PySR method as the most suitable for inferring equations, with some estimates being indistinguishable from the original analytical forms. These results highlight the potential of symbolic regression as a robust tool for inferring and modeling real-world phenomena.

Paper Structure

This paper contains 25 sections, 9 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Diagram of the usual process for adjusting a genetic programming-based symbolic regression algorithm. Each method iterates new combinations of functions, tuning model complexity until the best fit is found.
  • Figure 2: Force diagram of the unitary non-linear pendulum.
  • Figure 3: Schematic of the compartmental transition for the selected epidemiological models, which use only parts of all compartments displayed. For their full equations, see \ref{['tab:compartmental models']}.
  • Figure 4: Performance of the symbolic regression models in terms of $R^2$, with a black line displaying an average over all methods for a given system.
  • Figure 5: Resulting differences in $R^2$ when adding gaussian noise to the synthetic data of a (top) Lotka-Volterra and (bottom) SIR system.
  • ...and 1 more figures