Differentiable Programming for Differential Equations: A Review

Facundo Sapienza; Jordi Bolibar; Frank Schäfer; Brian Groenke; Avik Pal; Victor Boussange; Patrick Heimbach; Giles Hooker; Fernando Pérez; Per-Olof Persson; Christopher Rackauckas

Differentiable Programming for Differential Equations: A Review

Facundo Sapienza, Jordi Bolibar, Frank Schäfer, Brian Groenke, Avik Pal, Victor Boussange, Patrick Heimbach, Giles Hooker, Fernando Pérez, Per-Olof Persson, Christopher Rackauckas

TL;DR

This paper surveys the spectrum of methods for differentiable programming applied to differential-equation models, clarifying the distinctions between continuous versus discrete and forward versus reverse approaches. It provides a cohesive mathematical framework, compares the trade-offs in accuracy, memory, and computational cost, and discusses practical implementation considerations across direct and solver-based methods. The authors offer structured recommendations for practitioners and identify generalizations to higher-order ODEs, PDEs, chaotic systems, and SDEs, with an emphasis on scalability and reproducibility. The work aims to accelerate the integration of physics-based models with data-driven techniques through robust, efficient gradient computation in differentiable DE-based workflows.

Abstract

The differentiable programming paradigm is a cornerstone of modern scientific computing. It refers to numerical methods for computing the gradient of a numerical model's output. Many scientific models are based on differential equations, where differentiable programming plays a crucial role in calculating model sensitivities, inverting model parameters, and training hybrid models that combine differential equations with data-driven approaches. Furthermore, recognizing the strong synergies between inverse methods and machine learning offers the opportunity to establish a coherent framework applicable to both fields. Differentiating functions based on the numerical solution of differential equations is non-trivial. Numerous methods based on a wide variety of paradigms have been proposed in the literature, each with pros and cons specific to the type of problem investigated. Here, we provide a comprehensive review of existing techniques to compute derivatives of numerical solutions of differential equations. We first discuss the importance of gradients of solutions of differential equations in a variety of scientific domains. Second, we lay out the mathematical foundations of the various approaches and compare them with each other. Third, we cover the computational considerations and explore the solutions available in modern scientific software. Last but not least, we provide best-practices and recommendations for practitioners. We hope that this work accelerates the fusion of scientific models and data, and fosters a modern approach to scientific modelling.

Differentiable Programming for Differential Equations: A Review

TL;DR

Abstract

Paper Structure (64 sections, 81 equations, 7 figures, 1 table)

This paper contains 64 sections, 81 equations, 7 figures, 1 table.

Introduction
Scientific motivation: A domain science perspective
Machine learning
Computational physics and optimal design
Computational fluid dynamics
Quantum physics
Other applications
Geosciences
Numerical weather prediction
Oceanography
Climate science
Glaciology
Biology and ecology
Computational finance
Methods: A mathematical perspective
...and 49 more sections

Figures (7)

Figure 1: Schematic representation of the different methods available for differentiation involving differential equation solutions. These can be classified depending on whether they find the gradient by solving a new system of differential equations (continuous) or whether they manipulate unit algebraic operations (discrete). Additionally, these methods can be categorized depending on whether sensitivities are propagated from input to output (forward) or from output to input (reverse). There exist multiple intermediate methods to perform differentiation between the different axes of this figure. For example, elimination techniques can be used to efficiently evaluate derivatives (see Section \ref{['sec:ad-further']}) and continuous methods may rely on discrete methods.
Figure 2: Comparison between forward and reverse AD. Changing the order of Jacobian and vector multiplications changes the total number of floating-point operations, which leads to different computational complexities between forward and reverse mode. When computing directional derivatives with forward AD, there is a total of $\mathcal{O} (k)$ JVPs that need to be computed, which considering we need to repeat this procedure $p$ times gives a total complexity of $\mathcal{O} (kp)$. This is the opposite of what happens when we carry the VJPs from the left side of the expression, where the matrix of size $d_1 \times p$ has no effect in the intermediate calculations, making all the intermediate calculations $\mathcal{O} (1)$ with respect to $p$ and a total complexity of $\mathcal{O} (k + p)$.
Figure 3: Diagram showing how gradients are computed using discrete adjoints. On the left, we see how gradients will be computed if we use the chain rule applied to the directed triangle defined by the variables $\theta$, $U$, and $L$ (blue arrows). However, we can define the intermediate vector variable $G = G(U; \theta)$, which satisfies $G = 0$ as long as the discrete system of differential equations are satisfied, and apply the chain rule instead to the triangle defined by $\theta$, $G$, and $L$ (red arrows). In the red diagram, the calculation of $\frac{\partial L}{\partial G}$ is done by pivoting in $U$ as shown in the right diagram (shaded area). Notice that the use of adjoints avoids the calculation of the sensitivity $\frac{\partial U}{\partial \theta}$. The adjoint is defined as the partial derivative $\lambda^T = - \frac{\partial L}{\partial G}$ representing changes in the loss function due to variations in the discrete equation $G(U; \theta) = 0$.
Figure 4: Comparison between AD implemented with dual numbers and complex step differentiation. For the simple case of the function $f(x) = \sin(x^2)$, we can see how each operation is carried in the forward step by the dual component (blue) and the complex component (red). Whereas AD gives the exact gradient at the end of the forward run, in the case of the complex step method we need to take the limit in the imaginary part.
Figure 5: Computational graph associated to the discrete adjoint method. Reverse AD applied on top of the computational graph leads to the update rules for the discrete adjoint. The adjoint variable $\lambda_i$ in the discrete adjoint method coincides with the adjoint variable $\bar{g}_i$ defined in the backpropagation step.
...and 2 more figures

Differentiable Programming for Differential Equations: A Review

TL;DR

Abstract

Differentiable Programming for Differential Equations: A Review

Authors

TL;DR

Abstract

Table of Contents

Figures (7)