Scalable Sparse Regression for Model Discovery: The Fast Lane to Insight
Matthew Golden
TL;DR
The paper addresses scalable discovery of governing equations from data when symbolic libraries are large and coefficients may be very small. It introduces SPRINT, a fast, rank-1 update–driven extension of exhaustive sparse regression that uses bisection to identify optimal column modifications in the library matrix ${\bf G}$, with two modes: ${\rm SPRINT-}$ (removal) and ${\rm SPRINT+}$ (addition). The method relies on updates to the smallest singular value via secular functions after rank-1 changes, yielding empirical computational scaling of roughly $O(|\mathcal{L}|^{3.38})$ for SPRINT-- and $O(|\mathcal{L}|^{1.65})$ for SPRINT+, and a simple, largely hyperparameter-free model selection rule. Demonstrations on Kuramoto–Sivashinsky and MHD-like library sizes show that SPRINT+ can recover tiny coefficients (on the order of $10^{-6}$) and reproduce the same optimization elbow as exhaustive search, while offering orders-of-magnitude speedups for large symbol libraries. The approach enables robust, interpretable model discovery in high-dimensional symbolic spaces and can be parallelized to further reduce computation, making data-driven discovery feasible for complex dynamical systems.
Abstract
There exist endless examples of dynamical systems with vast available data and unsatisfying mathematical descriptions. Sparse regression applied to symbolic libraries has quickly emerged as a powerful tool for learning governing equations directly from data; these learned equations balance quantitative accuracy with qualitative simplicity and human interpretability. Here, I present a general purpose, model agnostic sparse regression algorithm that extends a recently proposed exhaustive search leveraging iterative Singular Value Decompositions (SVD). This accelerated scheme, Scalable Pruning for Rapid Identification of Null vecTors (SPRINT), uses bisection with analytic bounds to quickly identify optimal rank-1 modifications to null vectors. It is intended to maintain sensitivity to small coefficients and be of reasonable computational cost for large symbolic libraries. A calculation that would take the age of the universe with an exhaustive search but can be achieved in a day with SPRINT.
