Table of Contents
Fetching ...

The Elements of Differentiable Programming

Mathieu Blondel, Vincent Roulet

TL;DR

The Elements of Differentiable Programming presents a rigorous, math-first account of differentiable programming, unifying automatic differentiation, optimization, and probabilistic learning. It builds from foundational calculus and differential geometry to practical representations of parameterized programs as computation graphs and DAGs, and details how to differentiate through complex constructs like control flow and data structures. A core thread is the JVP/VJP framework and the role of the exponential family in probabilistic learning, enabling end-to-end differentiable models with rich uncertainty quantification. The book also clarifies how to design differentiable operations, including smoothing and optimization techniques, to extend differentiable programming beyond deep learning to reinforcement learning and scientific computing.

Abstract

Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.

The Elements of Differentiable Programming

TL;DR

The Elements of Differentiable Programming presents a rigorous, math-first account of differentiable programming, unifying automatic differentiation, optimization, and probabilistic learning. It builds from foundational calculus and differential geometry to practical representations of parameterized programs as computation graphs and DAGs, and details how to differentiate through complex constructs like control flow and data structures. A core thread is the JVP/VJP framework and the role of the exponential family in probabilistic learning, enabling end-to-end differentiable models with rich uncertainty quantification. The book also clarifies how to design differentiable operations, including smoothing and optimization techniques, to extend differentiable programming beyond deep learning to reinforcement learning and scientific computing.

Abstract

Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.
Paper Structure (432 sections, 73 theorems, 1175 equations, 83 figures, 13 tables, 23 algorithms)

This paper contains 432 sections, 73 theorems, 1175 equations, 83 figures, 13 tables, 23 algorithms.

Key Result

Proposition 2.1

Figures (83)

  • Figure 1: Neural networks can be seen as parameterized programs.
  • Figure 2: Thanks to automatic differentiation (autodiff), the user can focus on expressing the forward computation (model), enabling fast experimentation and alleviating the need for error-prone manual gradient derivation.
  • Figure 3: A function $f$ can be locally approximated around a point $w_0$ by a secant, a linear function $w \mapsto aw + b$ with slope $a$ and intercept $b$, crossing $f$ at $w_0$ with value $u_0= f(w_0)$ and crossing at $w_0 + \delta$ with value $u_\delta=f(w_0+\delta)$. Using $u_0 = a w_0 + b$ and $u_\delta = a(w_0 + \delta) + b$, we find that its slope is $a = (f(w_0 + \delta) - f(w_0)) / \delta$ and the intercept is $b=f(w_0) -aw_0$. The derivative $f'(w)$ of a function $f$ at a point $w_0$ is then defined as the limit of the slope $a$ when $\delta \to 0$. It is the slope of the tangent of $f$ at $w_0$. The value $f(w)$ of the function at $w$ can then be locally approximated around $w_0$ by $w \mapsto f'(w_0) w + f(w_0) - f'(w_0) w_0 = f(w_0) + f'(w_0)(w - w_0)$.
  • Figure 4: Illustration of discontinuity and non-differentiability. Left. A discontinuous function presents a jump in function values at a given point. Center. A continuous but non-differentiable everywhere function presents kinks at the points of non-differentiability. Right. A differentiable everywhere function is smooth.
  • Figure 5: The gradient of a function $f: \mathbb{R}^2\rightarrow \mathbb{R}$ at $(w_1, w_2)$ is the normal vector to the tangent space of the level set $L_{f(w_1, w_2)} = \{(w_1', w_2'): f(w_1', w_2') = f(w_1, w_2)\}$ and points towards points with higher function values.
  • ...and 78 more figures

Theorems & Definitions (206)

  • Definition 1.1: Differentiable programming
  • Definition 2.1: Limit
  • Definition 2.2: Continuous function
  • Remark 2.1: Little $o$ notation
  • Definition 2.3: Derivative
  • Proposition 2.1: Differentiability implies continuity
  • Example 2.1: Derivative of power function
  • Remark 2.2: Functions on a subset $\mathcal{U}$ of $\mathbb{R}$
  • Example 2.2: Applying rules of differentiation
  • Definition 2.4: Directional derivative
  • ...and 196 more