The Elements of Differentiable Programming

Mathieu Blondel; Vincent Roulet

The Elements of Differentiable Programming

Mathieu Blondel, Vincent Roulet

TL;DR

The Elements of Differentiable Programming presents a rigorous, math-first account of differentiable programming, unifying automatic differentiation, optimization, and probabilistic learning. It builds from foundational calculus and differential geometry to practical representations of parameterized programs as computation graphs and DAGs, and details how to differentiate through complex constructs like control flow and data structures. A core thread is the JVP/VJP framework and the role of the exponential family in probabilistic learning, enabling end-to-end differentiable models with rich uncertainty quantification. The book also clarifies how to design differentiable operations, including smoothing and optimization techniques, to extend differentiable programming beyond deep learning to reinforcement learning and scientific computing.

Abstract

Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.

The Elements of Differentiable Programming

TL;DR

Abstract

Paper Structure (432 sections, 73 theorems, 1175 equations, 83 figures, 13 tables, 23 algorithms)

This paper contains 432 sections, 73 theorems, 1175 equations, 83 figures, 13 tables, 23 algorithms.

Introduction
What is differentiable programming?
Book goals and scope
Intended audience
How to read this book?
Related work
Fundamentals
Differentiation
Univariate functions
Derivatives
Calculus rules
Leibniz's notation
Multivariate functions
Directional derivatives
Gradients
...and 417 more sections

Key Result

Proposition 2.1

Figures (83)

Figure 1: Neural networks can be seen as parameterized programs.
Figure 2: Thanks to automatic differentiation (autodiff), the user can focus on expressing the forward computation (model), enabling fast experimentation and alleviating the need for error-prone manual gradient derivation.
Figure 3: A function $f$ can be locally approximated around a point $w_0$ by a secant, a linear function $w \mapsto aw + b$ with slope $a$ and intercept $b$, crossing $f$ at $w_0$ with value $u_0= f(w_0)$ and crossing at $w_0 + \delta$ with value $u_\delta=f(w_0+\delta)$. Using $u_0 = a w_0 + b$ and $u_\delta = a(w_0 + \delta) + b$, we find that its slope is $a = (f(w_0 + \delta) - f(w_0)) / \delta$ and the intercept is $b=f(w_0) -aw_0$. The derivative $f'(w)$ of a function $f$ at a point $w_0$ is then defined as the limit of the slope $a$ when $\delta \to 0$. It is the slope of the tangent of $f$ at $w_0$. The value $f(w)$ of the function at $w$ can then be locally approximated around $w_0$ by $w \mapsto f'(w_0) w + f(w_0) - f'(w_0) w_0 = f(w_0) + f'(w_0)(w - w_0)$.
Figure 4: Illustration of discontinuity and non-differentiability. Left. A discontinuous function presents a jump in function values at a given point. Center. A continuous but non-differentiable everywhere function presents kinks at the points of non-differentiability. Right. A differentiable everywhere function is smooth.
Figure 5: The gradient of a function $f: \mathbb{R}^2\rightarrow \mathbb{R}$ at $(w_1, w_2)$ is the normal vector to the tangent space of the level set $L_{f(w_1, w_2)} = \{(w_1', w_2'): f(w_1', w_2') = f(w_1, w_2)\}$ and points towards points with higher function values.
...and 78 more figures

Theorems & Definitions (206)

Definition 1.1: Differentiable programming
Definition 2.1: Limit
Definition 2.2: Continuous function
Remark 2.1: Little $o$ notation
Definition 2.3: Derivative
Proposition 2.1: Differentiability implies continuity
Example 2.1: Derivative of power function
Remark 2.2: Functions on a subset $\mathcal{U}$ of $\mathbb{R}$
Example 2.2: Applying rules of differentiation
Definition 2.4: Directional derivative
...and 196 more

The Elements of Differentiable Programming

TL;DR

Abstract

The Elements of Differentiable Programming

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (83)

Theorems & Definitions (206)