Table of Contents
Fetching ...

Efficient Neural SDE Training using Wiener-Space Cubature

Luke Snow, Vikram Krishnamurthy

TL;DR

This paper addresses scalable training of neural stochastic differential equations by replacing Monte Carlo gradient estimation with Wiener-space cubature, extending cubature theory to nonlinear path-functionals and enabling deterministic, GPU-friendly ODE evaluations. It develops a Stratonovich reformulation, constructs cubature paths and weights, and proves non-asymptotic error bounds for nonlinear loss functionals. A high-order recombination algorithm drastically reduces the number of required ODE solves, achieving effective $\mathcal{O}(n^{-1})$ convergence under suitable parameter choices and providing concrete pre-processing complexity. Numerical studies show faster convergence and substantial wall-clock and memory savings compared to SDE Monte Carlo, across varying dimensions and architectures. The framework offers a principled, efficient approach to neural SDE training with potential broader impact in stochastic modeling and high-dimensional inference.

Abstract

A neural stochastic differential equation (SDE) is an SDE with drift and diffusion terms parametrized by neural networks. The training procedure for neural SDEs consists of optimizing the SDE vector field (neural network) parameters to minimize the expected value of an objective functional on infinite-dimensional path-space. Existing training techniques focus on methods to efficiently compute path-wise gradients of the objective functional with respect to these parameters, then pair this with Monte-Carlo simulation to estimate the gradient expectation. In this work we introduce a novel training technique which bypasses and improves upon this Monte-Carlo simulation; we extend results in the theory of Wiener space cubature to approximate the expected objective functional value by a weighted sum of functional evaluations of deterministic ODE solutions. Our main mathematical contribution enabling this approximation is an extension of cubature bounds to the setting of Lipschitz-nonlinear functionals acting on path-space. Our resulting constructive algorithm allows for more computationally efficient training along several lines. First, it circumvents Brownian motion simulation and enables the use of efficient parallel ODE solvers, thus decreasing the complexity of path-functional evaluation. Furthermore, and more surprisingly, we show that the number of paths required to achieve a given (expected loss functional oracle value) approximation can be reduced in this deterministic cubature regime. Specifically, we show that under reasonable regularity assumptions we can observe a O(1/n) convergence rate, where n is the number of path evaluations; in contrast with the standard O(1/sqrt(n)) rate of naive Monte-Carlo or the O(log(n)^d /n) rate of quasi-Monte-Carlo.

Efficient Neural SDE Training using Wiener-Space Cubature

TL;DR

This paper addresses scalable training of neural stochastic differential equations by replacing Monte Carlo gradient estimation with Wiener-space cubature, extending cubature theory to nonlinear path-functionals and enabling deterministic, GPU-friendly ODE evaluations. It develops a Stratonovich reformulation, constructs cubature paths and weights, and proves non-asymptotic error bounds for nonlinear loss functionals. A high-order recombination algorithm drastically reduces the number of required ODE solves, achieving effective convergence under suitable parameter choices and providing concrete pre-processing complexity. Numerical studies show faster convergence and substantial wall-clock and memory savings compared to SDE Monte Carlo, across varying dimensions and architectures. The framework offers a principled, efficient approach to neural SDE training with potential broader impact in stochastic modeling and high-dimensional inference.

Abstract

A neural stochastic differential equation (SDE) is an SDE with drift and diffusion terms parametrized by neural networks. The training procedure for neural SDEs consists of optimizing the SDE vector field (neural network) parameters to minimize the expected value of an objective functional on infinite-dimensional path-space. Existing training techniques focus on methods to efficiently compute path-wise gradients of the objective functional with respect to these parameters, then pair this with Monte-Carlo simulation to estimate the gradient expectation. In this work we introduce a novel training technique which bypasses and improves upon this Monte-Carlo simulation; we extend results in the theory of Wiener space cubature to approximate the expected objective functional value by a weighted sum of functional evaluations of deterministic ODE solutions. Our main mathematical contribution enabling this approximation is an extension of cubature bounds to the setting of Lipschitz-nonlinear functionals acting on path-space. Our resulting constructive algorithm allows for more computationally efficient training along several lines. First, it circumvents Brownian motion simulation and enables the use of efficient parallel ODE solvers, thus decreasing the complexity of path-functional evaluation. Furthermore, and more surprisingly, we show that the number of paths required to achieve a given (expected loss functional oracle value) approximation can be reduced in this deterministic cubature regime. Specifically, we show that under reasonable regularity assumptions we can observe a O(1/n) convergence rate, where n is the number of path evaluations; in contrast with the standard O(1/sqrt(n)) rate of naive Monte-Carlo or the O(log(n)^d /n) rate of quasi-Monte-Carlo.

Paper Structure

This paper contains 30 sections, 10 theorems, 93 equations, 5 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

Denote by $C_{bv}^0([0,T],\mathbb{R}^{d})$ the space of $\mathbb{R}^{d}$-valued continuous functions of bounded variation defined on $[0,T]$. Let $(X_t^{\theta})_{t\in[0,T]}$ be the solution of a SDE in Stratonovich form eq:strat, with vector fields $\{V_i^{\theta}\}_{i=0}^{d_b}$. There exist, for a such that for $t > 0$ , Here $m$ is the cubature "order", and $\phi_j^{\theta}(t)$ is the solution

Figures (5)

  • Figure 1: Convergence of cubature (degree--$5$) vs. Monte--Carlo estimates of $\mathcal{L}_{\text{data}}(X)$. Monte--Carlo exhibits the standard $O(n^{-1/2})$ decay in path count $n$, while cubature achieves faster convergence consistent with the $O(n^{-1})$ rate of Corollary \ref{['cor:noneap']}. For equal $n$, cubature estimates are uniformly more accurate.
  • Figure 2: Convergence of the variational loss of a neural SDE vs. training iteration, in dimensions $8,32,64,128$, using both cubature and Monte Carlo (MC) estimators. We observe that the optimization over the standard loss functional \ref{['eq:objective']} using traditional Monte Carlo, and the optimization over the cubature loss functional \ref{['eq:sum']}, behave equivalently in each dimension when measured with respect to training iterations.
  • Figure 3: Convergence of the variational loss of a neural SDE during training, in dimensions $8,32,64,128$, using both cubature and Monte Carlo (MC) estimators. The cubature method evaluates an estimate of the expected loss functional, at each training epoch, using ODE solutions w.r.t. deterministic cubature paths. The MC method evaluates this estimate by standard stochastic approximation. It is observed that while both methods converge at similar rates w.r.t. the number of training epochs, the cubature estimator offers significant speedups (Table \ref{['tab:timings']}), and thus the convergence vs. wall-clock execution time is much faster in the cubature method.
  • Figure 4: We train an 8-dimensional neural SDE to match the path dataset $X_{\text{data}}$ displayed (via first marginal dimension) in blue. The orange paths represent (first marginal dimension) samples from the neural SDE distribution before, during, and after training via both cubature and standard MC methods. Observe that the data approximation is qualitatively comparable at each stage of the training process for each technique. The advantage of the cubature method is that the per-epoch compute time and memory requirements are made more efficient, as displayed in Figure \ref{['fig:traintime']}.
  • Figure 5: Degree--$5$ cubature paths for discretization levels $k=1,\dots,5$. The number of paths $n$ grows exponentially in $k$ without recombination, but Section \ref{['sec:hor']} shows how recombination yields polynomial growth. Panel (f) shows that the one-time preprocessing cost remains negligible ($0.8$s for $k=10, n = 4612$). This validates the feasibility of using cubature paths in training while retaining the improved $O(n^{-1})$ rate of Corollary \ref{['cor:noneap']}, which improves training efficiency.

Theorems & Definitions (16)

  • Theorem 1: lyons2004cubature
  • Definition 2: Cubature Path and Weights
  • Lemma 3: lyons2004cubature Theorem 2.4
  • Theorem 4: Cubature Path Approximation Error
  • Definition 5: Localization litterer2012high
  • Definition 6: Measure Reduction litterer2012high
  • Definition 7: Reduced Measure w.r.t. Localization litterer2012high
  • Definition 8: Reduced Measure Procedure (RMP)
  • Theorem 9: Recombined Cubature Path Approximation Error
  • Corollary 10: Improved Estimate Efficiency
  • ...and 6 more