Table of Contents
Fetching ...

Integral Bayesian symbolic regression for optimal discovery of governing equations from scarce and noisy data

Oriol Cabanas-Tirapu, Sergio Cobo-Lopez, Savannah E. Sanchez, Forest L. Rohwer, Marta Sales-Pardo, Roger Guimerà

TL;DR

The paper tackles discovering governing differential equations from scarce and noisy time-series data without relying on derivative estimation. It introduces integral Bayesian symbolic regression (I-BMS), which evaluates integrated system trajectories $F_i$ and uses a description-length-based posterior to explore arbitrary symbolic forms via MCMC. On synthetic benchmarks (logistic and Lotka-Volterra), I-BMS recovers ground-truth models across noise regimes and data sparsity, outperforming FD-BMS, SD-BMS, and ensemble-SINDy variants in robustness and predictive accuracy. Applying I-BMS to bacterial growth data yields a novel growth law $\frac{dB}{dt} = B r \left(1 + c_{0}\left(c_{1} B e^{c_{2} B}\right)^{B^{3}}\right)$ that outperforms logistic and Gompertz models, capturing lag, exponential, and saturation phases with improved description length and RMSE, illustrating data-driven, interpretable insights into microbial physiology.

Abstract

Understanding how systems evolve over time often requires discovering the differential equations that govern their behavior. Automatically learning these equations from experimental data is challenging when the data are noisy or limited, and existing approaches struggle, in particular, with the estimation of unobserved derivatives. Here, we introduce an integral Bayesian symbolic regression method that learns governing equations directly from raw time-series data, without requiring manual assumptions or error-prone derivative estimation. By sampling the space of symbolic differential equations and evaluating them via numerical integration, our method robustly identifies governing equations even from noisy or scarce data. We show that this approach accurately recovers ground-truth models in synthetic benchmarks, and that it makes quasi-optimal predictions of system dynamics for all noise regimes. Applying this method to bacterial growth experiments across multiple species and substrates, we discover novel growth equations that outperform classical models in accurately capturing all phases of microbial proliferation, including lag, exponential, and saturation. Unlike standard approaches, our method reveals subtle shifts in growth dynamics, such as double ramp-ups or non-canonical transitions, offering a deeper, data-driven understanding of microbial physiology.

Integral Bayesian symbolic regression for optimal discovery of governing equations from scarce and noisy data

TL;DR

The paper tackles discovering governing differential equations from scarce and noisy time-series data without relying on derivative estimation. It introduces integral Bayesian symbolic regression (I-BMS), which evaluates integrated system trajectories and uses a description-length-based posterior to explore arbitrary symbolic forms via MCMC. On synthetic benchmarks (logistic and Lotka-Volterra), I-BMS recovers ground-truth models across noise regimes and data sparsity, outperforming FD-BMS, SD-BMS, and ensemble-SINDy variants in robustness and predictive accuracy. Applying I-BMS to bacterial growth data yields a novel growth law that outperforms logistic and Gompertz models, capturing lag, exponential, and saturation phases with improved description length and RMSE, illustrating data-driven, interpretable insights into microbial physiology.

Abstract

Understanding how systems evolve over time often requires discovering the differential equations that govern their behavior. Automatically learning these equations from experimental data is challenging when the data are noisy or limited, and existing approaches struggle, in particular, with the estimation of unobserved derivatives. Here, we introduce an integral Bayesian symbolic regression method that learns governing equations directly from raw time-series data, without requiring manual assumptions or error-prone derivative estimation. By sampling the space of symbolic differential equations and evaluating them via numerical integration, our method robustly identifies governing equations even from noisy or scarce data. We show that this approach accurately recovers ground-truth models in synthetic benchmarks, and that it makes quasi-optimal predictions of system dynamics for all noise regimes. Applying this method to bacterial growth experiments across multiple species and substrates, we discover novel growth equations that outperform classical models in accurately capturing all phases of microbial proliferation, including lag, exponential, and saturation. Unlike standard approaches, our method reveals subtle shifts in growth dynamics, such as double ramp-ups or non-canonical transitions, offering a deeper, data-driven understanding of microbial physiology.

Paper Structure

This paper contains 3 sections, 14 equations, 4 figures.

Figures (4)

  • Figure 1: Integral Bayesian symbolic regression and benchmark data.(A) Schematic representation of the approach. We start with scarce and noisy measured data $D$ from a system driven by the equation ${\dot{x}} = f^*(x, {\bm{\theta}})$. Given the observed data and any model $f$, we can evaluate the posterior probability $p(f|D)$ of the model without needing to estimate numerically the derivatives ${\dot{x}}$. This involves optimizing model parameters ${\bm{\theta}}$ and initial conditions $x_0$ on the integrated form, and calculating the description length ${\mathscr{L}}$ (see text). We select the model with the highest posterior (minimum description length), either by searching exhaustively within a predefined set of models or by sampling models through MCMC. (B-G) Left panels show the noisy synthetic data used in our validations, which we represent with gray lines; the black line corresponds to the noiseless ground truth behavior. Right panels show the phase space (measured variable against its derivative), with green dots representing the finite difference estimations of the derivative, yellow dots representing the smoothed estimate, and black lines representing the ground truth. (B-C) Logistic model. (D-G) Lotka-Volterra model. (B, D, F) Low noise regime. (C, E, G) High-noise regime.
  • Figure 2: Validation on synthetic data. We generate synthetic data using the logistic and Lotka-Volterra models, with different levels of noise and different number of data points. We then explore exhaustively the space of all polynomial expressions (see "Exhaustive search of linear terms" in Methods) for $f_i({\mathbf{x}}, {\bm{\theta}})$, and use the BMS to select the model with the shortest description length. We do this for the integral BMS (I-BMS), as well as for the standard BMS with finite difference-estimated derivatives (FD-BMS) and with smoothed derivatives (SD-BMS); and we benchmark these algorithms against ensemble SINDy (ESINDy) and weak ensemble SINDY (W-ESINDy), with the same library functions as the BMS. (A-C) Phase space trajectories predicted by the governing equations obtained using each of the approaches for one particular realization of the noise. Left panels correspond to the low-noise regime, while right panels correspond to the high-noise regime. (D-G) To quantify the ability of each approach to identify the true governing equation, we show the detection accuracy, that is, the fraction of times that the true governing equation is exactly recovered (each data point corresponds to an average over 40 datasets $D$): (D) as a function of noise level, for the logistic model and fixed number of observed points ($N = 120$); (E) as a function of the number of points, for the logistic model, fixed noise ($\sigma = 0.05$) and fixed total time range; (F) as a function of noise level, for the Lotka-Volterra model and fixed number of observed points ($N = 180$); (G) as a function of the number of points, for the Lotka-Volterra model, fixed noise ($\sigma = 3.5$) and fixed time spacing between consecutive observations.
  • Figure 3: Learnability and model predictive accuracy across noise levels. We generate synthetic data for the logistic and Lotka-Volterra models, as in Fig. \ref{['fig:detection_logistic']}. We then use MCMC to sample models from the posterior $p(f_i|D)$ (for each dataset, we run two independent MCMC processes with 3,000 steps each, except for the I-BMS on Lotka-Volterra data, for which we use 4,000 steps), and consider the most plausible model (equivalently, the model with the minimum description length). All points are averages over 40 datasets $D$. (A-B) Learnability as a function of noise level for the logistic and Lotka-Volterra datasets, respectively. (C-D) Root mean squared error (RMSE) between the ground truth data $x(t)$ and the predictions of the minimum description length model $x^e(t)$, normalized by the noise level $\sigma$ for the logistic and Lotka-Volterra datasets, respectively.
  • Figure 4: I-BMS model and reference growth models for bacterial growth. We show results for two bacteria-substrate pairs from the training set, and two from the test set. (A-D) Empirical groth curves and numerically integrated curves $x^{e}(t)$ for each model. (E-H) Derivative values plotted against different measured optical densities for each model. (I,K) Root mean squared error (RMSE) of the integrated curve relative to the observed data, computed for all datasets in the training and test sets, respectively. (J-L) Description length of the models for all training and test datasets.