Table of Contents
Fetching ...

Adjoint path-kernel method for backpropagation and data assimilation in unstable diffusions

Angxiu Ni

TL;DR

The paper develops an adjoint path-kernel framework for computing parameter-gradients of discrete-time and continuous-time stochastic systems, including non-hyperbolic dynamics with multiplicative noise. A key feature is a shared main term across many parameters, yielding near-parameter-count-free cost and enabling gradient-based optimization in high dimensions and over long horizons, even when gradients explode. The authors demonstrate the approach on Lorenz-96 with multiplicative noise and integrate it into a challenging 4D-Var data assimilation setting with partial observations and unknown dynamics, solved via stochastic gradient descent. This advances stable long-horizon learning and parameter inference in chaotic diffusion systems and offers practical tools for high-dimensional data assimilation tasks.

Abstract

We derive the adjoint path-kernel method for computing parameter-gradients (linear responses) of SDEs. Its cost is almost independent of the number of parameters, and it works for non-hyperbolic systems with parameter-controlled multiplicative noise. With this new formula, we extend the conventional backpropagation method to settings with gradient explosion, and demonstrate it on the 40-dimensional Lorenz 96 system. Moreover, we consider a difficult version of the 4D-Var data assimilation problem where (1) the deterministic part of the model is chaotic, (2) the loss is a single long-time functional accounting for discrepancies in both the observations and the dynamics, (3) some parameters in the dynamics are unknown, and (4) some coordinates of the states cannot be observed, and cannot be reasonably inferred from other coordinates within a short time. We model the correction term at each time-step separately as a parameterized function of the random state. With our new tool, we can run stochastic gradient descent to find the path and parameters that best match the low-dimensional observation data. We demonstrate this on the 10D Lorenz-96 system with 8D observations.

Adjoint path-kernel method for backpropagation and data assimilation in unstable diffusions

TL;DR

The paper develops an adjoint path-kernel framework for computing parameter-gradients of discrete-time and continuous-time stochastic systems, including non-hyperbolic dynamics with multiplicative noise. A key feature is a shared main term across many parameters, yielding near-parameter-count-free cost and enabling gradient-based optimization in high dimensions and over long horizons, even when gradients explode. The authors demonstrate the approach on Lorenz-96 with multiplicative noise and integrate it into a challenging 4D-Var data assimilation setting with partial observations and unknown dynamics, solved via stochastic gradient descent. This advances stable long-horizon learning and parameter inference in chaotic diffusion systems and offers practical tools for high-dimensional data assimilation tasks.

Abstract

We derive the adjoint path-kernel method for computing parameter-gradients (linear responses) of SDEs. Its cost is almost independent of the number of parameters, and it works for non-hyperbolic systems with parameter-controlled multiplicative noise. With this new formula, we extend the conventional backpropagation method to settings with gradient explosion, and demonstrate it on the 40-dimensional Lorenz 96 system. Moreover, we consider a difficult version of the 4D-Var data assimilation problem where (1) the deterministic part of the model is chaotic, (2) the loss is a single long-time functional accounting for discrepancies in both the observations and the dynamics, (3) some parameters in the dynamics are unknown, and (4) some coordinates of the states cannot be observed, and cannot be reasonably inferred from other coordinates within a short time. We model the correction term at each time-step separately as a parameterized function of the random state. With our new tool, we can run stochastic gradient descent to find the path and parameters that best match the low-dimensional observation data. We demonstrate this on the 10D Lorenz-96 system with 8D observations.

Paper Structure

This paper contains 19 sections, 2 theorems, 67 equations, 8 figures.

Key Result

theorem 1

Fix any $x_0$, $v_0$, and any $\alpha_n$ (called a 'schedule') a scalar process adapted to $\EuScript{F}_n$ and independent of $\gamma$. Consider the random dynamical system, Note that $f^\gamma(\cdot)$ and $\sigma^\gamma(\cdot)$ depend on the parameter $\gamma$. Let $v_n$ be the solution of the following tangent equation starting from $v_0$ Denote $\Phi^{avg}_N:=\mathbb{E}_{}\left[\Phi(x_N)\righ

Figures (8)

  • Figure 1: Plot of $x^0_t, x^1_t$ from a typical orbit of length $T=2$ at $\gamma^0=8, \gamma^1=2$.
  • Figure 2: $\Phi^{avg}$ and $\delta \Phi^{avg}$ of the stationary measure. The dots are $\Phi^{avg}$, and the short lines are $\delta \Phi^{avg}$ computed by the adjoint path-kernel algorithm; they are computed from the same orbit of $T=1000$, $W=2$. Left: $\Phi^{avg}$ vs. $\gamma^0$, where each line is computed with a different $\gamma^1$. The black triangles are computed on the original Lorenz system without noise. Right: $\Phi^{avg}$ vs. $\gamma^1$, each line has a different $\gamma^0$.
  • Figure 3: Gradients and the contour of $\rho(\Phi)$. The arrow is $1/10$ of the gradient.
  • Figure 4: Lorenz 63 example. Comparison of the 2D data (red line) and the observation generated by the 3D model (black line, averaged over $L=10$ samples) at different stages of the optimization. From left to right: initial guess, after 21 updates (or 'epoch'), and after 291 updates.
  • Figure 5: $\Phi^{avg}$ vs. number of updates for $T=2$ and different sample size $L$.
  • ...and 3 more figures

Theorems & Definitions (5)

  • theorem 1: tangent discrete-time path-kernel
  • theorem 4: adjoint discrete-time path-kernel
  • proof
  • proof : Derivation of \ref{['t:adjSDE']}
  • proof : Derivation