Integral Bayesian symbolic regression for optimal discovery of governing equations from scarce and noisy data
Oriol Cabanas-Tirapu, Sergio Cobo-Lopez, Savannah E. Sanchez, Forest L. Rohwer, Marta Sales-Pardo, Roger Guimerà
TL;DR
The paper tackles discovering governing differential equations from scarce and noisy time-series data without relying on derivative estimation. It introduces integral Bayesian symbolic regression (I-BMS), which evaluates integrated system trajectories $F_i$ and uses a description-length-based posterior to explore arbitrary symbolic forms via MCMC. On synthetic benchmarks (logistic and Lotka-Volterra), I-BMS recovers ground-truth models across noise regimes and data sparsity, outperforming FD-BMS, SD-BMS, and ensemble-SINDy variants in robustness and predictive accuracy. Applying I-BMS to bacterial growth data yields a novel growth law $\frac{dB}{dt} = B r \left(1 + c_{0}\left(c_{1} B e^{c_{2} B}\right)^{B^{3}}\right)$ that outperforms logistic and Gompertz models, capturing lag, exponential, and saturation phases with improved description length and RMSE, illustrating data-driven, interpretable insights into microbial physiology.
Abstract
Understanding how systems evolve over time often requires discovering the differential equations that govern their behavior. Automatically learning these equations from experimental data is challenging when the data are noisy or limited, and existing approaches struggle, in particular, with the estimation of unobserved derivatives. Here, we introduce an integral Bayesian symbolic regression method that learns governing equations directly from raw time-series data, without requiring manual assumptions or error-prone derivative estimation. By sampling the space of symbolic differential equations and evaluating them via numerical integration, our method robustly identifies governing equations even from noisy or scarce data. We show that this approach accurately recovers ground-truth models in synthetic benchmarks, and that it makes quasi-optimal predictions of system dynamics for all noise regimes. Applying this method to bacterial growth experiments across multiple species and substrates, we discover novel growth equations that outperform classical models in accurately capturing all phases of microbial proliferation, including lag, exponential, and saturation. Unlike standard approaches, our method reveals subtle shifts in growth dynamics, such as double ramp-ups or non-canonical transitions, offering a deeper, data-driven understanding of microbial physiology.
