Table of Contents
Fetching ...

A note on the adjoint method for neural ordinary differential equation network

Pipi Hu

TL;DR

This note analyzes the adjoint method for neural ODE networks with a focus on rigorous justification via calculus of variations. It demonstrates that gradients produced by the adjoint framework pertain to the Lagrange cost $J(\theta)$ rather than the terminal loss $L(z(t_f))$, and shows that a time-invariant control $\theta$ simplifies derivative structure while enabling a link between the adjoint $a(t)$ and the cost gradient $\partial L/\partial z$. The authors provide a rigorous continuous-time proof and a simple but nonrigorous alternative, extend the framework to multi-label losses with both single- and multi-point backpropagation, and present an operator-adjoint perspective that unifies continuous and discrete formulations. They also emphasize that discrete adjoint schemes must align with forward discretization to recover backpropagation results, and discuss implications for stability and implementation. Overall, the work clarifies conditions under which adjoint methods reproduce backpropagation results and highlights how discrete formulation choices impact gradient accuracy and training dynamics.

Abstract

Perturbation and operator adjoint method are used to give the right adjoint form rigourously. From the derivation, we can have following results: 1) The loss gradient is not an ODE, it is an integral and we shows the reason; 2) The traditional adjoint form is not equivalent with the back propagation results. 3) The adjoint operator analysis shows that if and only if the discrete adjoint has the same scheme with the discrete neural ODE, the adjoint form would give the same results as BP does.

A note on the adjoint method for neural ordinary differential equation network

TL;DR

This note analyzes the adjoint method for neural ODE networks with a focus on rigorous justification via calculus of variations. It demonstrates that gradients produced by the adjoint framework pertain to the Lagrange cost rather than the terminal loss , and shows that a time-invariant control simplifies derivative structure while enabling a link between the adjoint and the cost gradient . The authors provide a rigorous continuous-time proof and a simple but nonrigorous alternative, extend the framework to multi-label losses with both single- and multi-point backpropagation, and present an operator-adjoint perspective that unifies continuous and discrete formulations. They also emphasize that discrete adjoint schemes must align with forward discretization to recover backpropagation results, and discuss implications for stability and implementation. Overall, the work clarifies conditions under which adjoint methods reproduce backpropagation results and highlights how discrete formulation choices impact gradient accuracy and training dynamics.

Abstract

Perturbation and operator adjoint method are used to give the right adjoint form rigourously. From the derivation, we can have following results: 1) The loss gradient is not an ODE, it is an integral and we shows the reason; 2) The traditional adjoint form is not equivalent with the back propagation results. 3) The adjoint operator analysis shows that if and only if the discrete adjoint has the same scheme with the discrete neural ODE, the adjoint form would give the same results as BP does.
Paper Structure (13 sections, 3 theorems, 79 equations, 3 figures)

This paper contains 13 sections, 3 theorems, 79 equations, 3 figures.

Key Result

Proposition 1

Adjoint method: The adjoint variable $a(t)$ satisfies an back propagation ODE and the gradients of the Lagrange cost $J(\theta) = L(z(t_f)) + \int_{t_0}^{t_f} a(t) \bigl(\dot{z}(t)-f(t,z(t),\theta(t))\bigr)dt$ has the form

Figures (3)

  • Figure 1: The perturbation schemes. (a): the perturbation of $\theta$. (b) the perturbation of $z$ based on the perturbation of $\theta$.
  • Figure 2: The perturbation schemes. (a): the perturbation of $\theta$. (b) the perturbation of $z$ based on the perturbation of $\theta$ and the times of multi labels are denoted. $t_i$ represent the $i^{th}$ label on the time line with the corresponding trajectory value $z(t_i)$.
  • Figure 3: The schemes' differences.

Theorems & Definitions (6)

  • Proposition 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Remark 1