Sparse Weak-Form Discovery of Stochastic Generators

Eshwar R A; Gajanan V. Honnavar

Sparse Weak-Form Discovery of Stochastic Generators

Eshwar R A, Gajanan V. Honnavar

Abstract

The proposed algorithm seeks to provide a novel data-driven framework for the discovery of stochastic differential equations (SDEs) by application of the Weak-formulation to stochastic SINDy. This Weak formulation of the algorithm provides a noise-robust methodology that avoids traditional noisy derivative computation using finite differences. An additional novelty is the adoption of spatial Gaussian test functions in place of temporal test functions, wherein the use of the kernel weight $K_j(X_{t_n})$ guarantees unbiasedness in expectation and prevents the structural regression bias that is otherwise pertinent with temporal test functions. The proposed framework converts the SDE identification problem into two SINDy based linear sparse identification problems. We validate the algorithm on three SDEs, for which we recover all active non-linear terms with coefficient errors below 4%, stationary-density total-variation distances below 0.01, and autocorrelation functions that reproduce true relaxation timescales across all three benchmarks faithfully.

Sparse Weak-Form Discovery of Stochastic Generators

Abstract

guarantees unbiasedness in expectation and prevents the structural regression bias that is otherwise pertinent with temporal test functions. The proposed framework converts the SDE identification problem into two SINDy based linear sparse identification problems. We validate the algorithm on three SDEs, for which we recover all active non-linear terms with coefficient errors below 4%, stationary-density total-variation distances below 0.01, and autocorrelation functions that reproduce true relaxation timescales across all three benchmarks faithfully.

Paper Structure (55 sections, 4 theorems, 54 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 55 sections, 4 theorems, 54 equations, 5 figures, 2 tables, 1 algorithm.

Introduction
Motivation
Stochastic Systems and the Identification Challenge
Prior Work and the Gap
This Work
Background
Itô Diffusions and the Infinitesimal Generator
Sparse Identification of Nonlinear Dynamics
Stochastic SINDy
Weak SINDy for Deterministic Systems
Methodology
Problem Formulation
Spatial Gaussian Test Functions
The Drift Identification System
Diffusion Identification via Quadratic Variation
...and 40 more sections

Key Result

Proposition 1

Under Assumptions ass:geo_erg--ass:regularity, as $T\to\infty$ with $\Delta t=T/N$ fixed, If $\bar{A}$ has full column rank, then the OLS estimator converges: $\hat{c} \xrightarrow{a.s.} c^*$ as $T\to\infty$.

Figures (5)

Figure 1: Recovered vs. true drift and diffusion functions for all three benchmark systems. Blue solid lines show the ground truth; red dashed lines show the weak SINDy estimates from Algorithm \ref{['alg:main']}. Top row (drift functions): OU process, $b(x)=-\theta x$, mean relative error 2.0%; double-well Langevin system, $b(x)=x-x^3$, mean rel. err. 2.7%; multiplicative diffusion, $b(x)=-2x$, mean rel. err. 3.9%. Bottom row (diffusion functions): OU process, $a(x)=\sigma_0^2=0.490$, mean rel. err. 0.0%; double-well system, $a(x)=\sigma_0^2=0.250$, mean rel. err. 0.1%; multiplicative system, $a(x)=0.25(1+x^2)$, mean rel. err. 0.4% (after drift-bias correction). $M=50$ spatial Gaussian kernels with $h=0.22$ (OU and double-well) or $h=0.27$ (multiplicative). All recovered curves are visually indistinguishable from ground truth at the displayed scale.
Figure 2: LassoCV regularisation paths for all six sub-problems. Each panel plots the mean cross-validated MSE (averaged over five trajectory folds) as a function of the regularisation strength $\alpha$, shown with $\alpha$ decreasing from left to right. Red dashed vertical lines mark the selected $\alpha^*$. Top row (drift): OU ($\alpha^*\approx1.6\times10^{-3}$), double-well ($\approx4.9\times10^{-5}$), multiplicative ($\approx2.1\times10^{-4}$). Bottom row (diffusion): OU ($\approx4.5\times10^{-7}$), double-well ($\approx1.0\times10^{-8}$), multiplicative ($\approx1.1\times10^{-5}$). In every panel the sharp elbow separates the over-regularised regime (to the left of the selected $\alpha^*$, where active terms are forced to zero and the CV MSE rises sharply) from the under-regularised regime (to the right, where inactive terms are admitted and CV MSE increases due to over-fitting). The clean elbow structure confirms that grouped CV reliably identifies the correct sparsity level in all six cases.
Figure 3: Stationary density: true SDE vs. recovered model. Densities are computed analytically using the Fokker--Planck formula $\pi(x)\propto a(x)^{-1}\exp\!\bigl(2\int_0^x b(y)/a(y)\,dy\bigr)$. Blue solid lines show the true SDE density; red dashed lines show the recovered model density. The shaded region between the two curves quantifies the pointwise discrepancy. Left (OU): Gaussian stationary distribution reproduced with total variation $\mathrm{TV}=0.0050$. Centre (double-well): Bimodal distribution with peaks at $x\approx\pm1$ faithfully captured; $\mathrm{TV}=0.0092$. The small discrepancy in peak heights is consistent with the 2.9% error in the cubic drift coefficient. Right (multiplicative): Unimodal heavy-tailed distribution reproduced with $\mathrm{TV}=0.0093$, demonstrating the effectiveness of the bias correction for state-dependent diffusion. In all panels, the shaded discrepancy regions are visually negligible compared to the density scale.
Figure 4: Autocorrelation check: true SDE vs. recovered model. Empirical autocorrelation functions computed from 200,000-step simulations of both the true dynamics and the recovered model. Light blue solid lines show the true SDE; red dashed lines show the recovered model. Left (OU): Recovered relaxation rate $\hat{\theta}=0.980$ (err. 2.0%) closely matches the analytical $e^{-\tau}$ (black dotted). The recovered autocorrelation is nearly indistinguishable from both the true SDE and the analytic reference. Centre (double-well): True and recovered autocorrelations agree closely across both the fast intra-well relaxation ($\tau\lesssim0.2$) and the slower inter-well mixing regime. No closed-form analytic reference is available. Right (multiplicative): The recovered state-dependent diffusion faithfully reproduces the mixing rate of the true SDE across the full range of displayed lags.
Figure 5: Theoretical noise scaling: Weak Form vs Kramers--Moyal. All curves are purely analytical; no regression is performed. Left: KM noise magnitude $\sigma_{\rm obs}/\Delta t$ as a function of $\Delta t$ for three SNR levels. The noise diverges as $\Delta t\to0$. Centre: WF effective noise $\sigma_{\rm obs}/\sqrt{Nh_{\rm eff}}$ for the same SNR levels, where $N=T/\Delta t$ and $h_{\rm eff}=\sqrt{\pi/2}\,h$. The noise grows only as $\sqrt{\Delta t}$ and remains bounded as $\Delta t\to0$. Right: Ratio of KM noise to WF noise (SNR advantage), which grows as $\Delta t^{-3/2}$ (dotted reference line). At the experimental setting $\Delta t=0.002$ (vertical dotted line), the advantage exceeds $10^4$ for SNR$=$10. $T=100$, $h_{\rm eff}=\sqrt{\pi/2}\times0.22\approx0.276$.

Theorems & Definitions (9)

Proposition 1: Strong consistency
proof
Corollary 2: Best $L^2(\mu)$ approximation
proof
Proposition 3: Asymptotic normality
proof
Remark 1
Proposition 4: Noise robustness
proof

Sparse Weak-Form Discovery of Stochastic Generators

Abstract

Sparse Weak-Form Discovery of Stochastic Generators

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)