Table of Contents
Fetching ...

Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study

Yuxiang Feng, Niall M Mangan, Manu Jayadharan

Abstract

Data-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models.

Ill-Conditioning in Dictionary-Based Dynamic-Equation Learning: A Systems Biology Case Study

Abstract

Data-driven discovery of governing equations from time-series data provides a powerful framework for understanding complex biological systems. Library-based approaches that use sparse regression over candidate functions have shown considerable promise, but they face a critical challenge when candidate functions become strongly correlated: numerical ill-conditioning. Poor or restricted sampling, together with particular choices of candidate libraries, can produce strong multicollinearity and numerical instability. In such cases, measurement noise may lead to widely different recovered models, obscuring the true underlying dynamics and hindering accurate system identification. Although sparse regularization promotes parsimonious solutions and can partially mitigate conditioning issues, strong correlations may persist, regularization may bias the recovered models, and the regression problem may remain highly sensitive to small perturbations in the data. We present a systematic analysis of how ill-conditioning affects sparse identification of biological dynamics using benchmark models from systems biology. We show that combinations involving as few as two or three terms can already exhibit strong multicollinearity and extremely large condition numbers. We further show that orthogonal polynomial bases do not consistently resolve ill-conditioning and can perform worse than monomial libraries when the data distribution deviates from the weight function associated with the orthogonal basis. Finally, we demonstrate that when data are sampled from distributions aligned with the appropriate weight functions corresponding to the orthogonal basis, numerical conditioning improves, and orthogonal polynomial bases can yield improved model recovery accuracy across two baseline models.
Paper Structure (12 sections, 3 equations, 3 figures, 1 table)

This paper contains 12 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Numerical ill-conditioning of candidate function libraries for baseline biological dynamical systems. Panels (a)–(b) show representative time-series data generated from numerical simulations of the Lotka–Volterra predator–prey system and a chemical reaction network (CRN), respectively. Panels (c)–(d) report the condition numbers of full polynomial function libraries constructed using monomial, Legendre, and Chebyshev bases as functions of library degree for the two systems. Panels (e)–(f) quantify the prevalence of ill-conditioning by counting the number of ill-conditioned two-term and three-term combinations in the candidate libraries, measured as the base-10 logarithm of the number of combinations exceeding an $\mathrm{R}^2$ threshold. Results are shown separately for the Lotka–Volterra system (left column) and the CRN model (right column).
  • Figure 2: Conditioning of candidate function libraries across benchmark biological models as a function of model complexity. Panel (a) shows the dependence of the full library condition number on model complexity for each model. Panel (b) focuses on the conditioning of sub-matrices formed by features corresponding to false negatives or missing and false positives or wrong terms in the mis-identified model after sparse regression, with each point representing the mean across equations within a model and error bars indicating one standard deviation.
  • Figure 3: Distribution–basis alignment restores orthogonality, improves conditioning, and enables accurate sparse model recovery. Panels (a)–(d) compare the empirical distributions induced by system dynamics with the theoretical distributions associated with the weight functions required to preserve orthogonality for Legendre and Chebyshev bases. Solid curves represent empirical probability density functions constructed from simulated state trajectories generated by numerical simulation of the L-V system (left column) and a CRN (right column), reflecting the data distributions induced by the underlying dynamics and serving as proxies for experimentally observed data. Histograms depict data resampled from these trajectories according to the corresponding idealized distributions implied by the orthogonality weight functions (uniform for Legendre and arcsine for Chebyshev). Panels (e)–(f) report the condition numbers of full candidate function libraries constructed from monomial and orthogonal polynomial bases when evaluated on original data versus distribution-aligned samples. Panels (g)–(h) show the resulting model identification errors quantified as the number of incorrectly recovered equations. Candidate function libraries in panels (e)–(h) are constructed using polynomial bases of degree 5.