Gradients of Functions of Large Matrices

Nicholas Krämer; Pablo Moreno-Muñoz; Hrittik Roy; Søren Hauberg

Gradients of Functions of Large Matrices

Nicholas Krämer, Pablo Moreno-Muñoz, Hrittik Roy, Søren Hauberg

TL;DR

The present work is the first to explain how to differentiate these workhorses of numerical linear algebra efficiently, and derives previously unknown adjoint systems for Lanczos and Arnoldi iterations, and shows that the resulting code can compete with Diffrax when it comes to differentiating PDEs, GPyTorch for selecting Gaussian process models and beats standard factorisation methods for calibrating Bayesian neural networks.

Abstract

Tuning scientific and probabilistic machine learning models $-$ for example, partial differential equations, Gaussian processes, or Bayesian neural networks $-$ often relies on evaluating functions of matrices whose size grows with the data set or the number of parameters. While the state-of-the-art for evaluating these quantities is almost always based on Lanczos and Arnoldi iterations, the present work is the first to explain how to differentiate these workhorses of numerical linear algebra efficiently. To get there, we derive previously unknown adjoint systems for Lanczos and Arnoldi iterations, implement them in JAX, and show that the resulting code can compete with Diffrax when it comes to differentiating PDEs, GPyTorch for selecting Gaussian process models and beats standard factorisation methods for calibrating Bayesian neural networks. All this is achieved without any problem-specific code optimisation. Find the code at https://github.com/pnkraemer/experiments-lanczos-adjoints and install the library with pip install matfree.

Gradients of Functions of Large Matrices

TL;DR

Abstract

Tuning scientific and probabilistic machine learning models

for example, partial differential equations, Gaussian processes, or Bayesian neural networks

often relies on evaluating functions of matrices whose size grows with the data set or the number of parameters. While the state-of-the-art for evaluating these quantities is almost always based on Lanczos and Arnoldi iterations, the present work is the first to explain how to differentiate these workhorses of numerical linear algebra efficiently. To get there, we derive previously unknown adjoint systems for Lanczos and Arnoldi iterations, implement them in JAX, and show that the resulting code can compete with Diffrax when it comes to differentiating PDEs, GPyTorch for selecting Gaussian process models and beats standard factorisation methods for calibrating Bayesian neural networks. All this is achieved without any problem-specific code optimisation. Find the code at https://github.com/pnkraemer/experiments-lanczos-adjoints and install the library with pip install matfree.

Paper Structure (46 sections, 3 theorems, 68 equations, 12 figures, 7 tables, 4 algorithms)

This paper contains 46 sections, 3 theorems, 68 equations, 12 figures, 7 tables, 4 algorithms.

Introduction
Contributions
Related work
Problem statement
Limitations and future work
The method: Adjoints of the Lanczos and Arnoldi iterations
Notation
Implicit differentiation
Adjoint system of the Arnoldi and Lanczos iterations
Matrix-free implementation
Solving the adjoint systems
Reorthogonalisation
Summary (before the case studies)
Case study: Exact Gaussian processes
Setup: Like GPyTorch's defaults
...and 31 more sections

Key Result

Theorem 4.1

Let $K \in \mathbb{N}$, $v \in \mathbb{R}$, and $A \in \mathbb{R}^{N \times N}$, and a loss $\rho(\cdot) \in \mathbb{R}$ be given. If $Q \in \mathbb{R}^{N \times K}$, $H \in \mathbb{R}^{K \times K}$, $r \in \mathbb{R}^N$, and $c \in \mathbb{R}$ solve the forward constraint and if $\lambda \in \mathbb{R}^N$, $\Lambda \in \mathbb{R}^{N \times K}$, $\gamma \in \mathbb{R}^{K}$, $\Gamma \in \mathbb{R}^

Figures (12)

Figure 1: Values (down) and gradients (up) of functions of large matrices.
Figure 2: Lanczos/Arnoldi iteration.
Figure 2: Accuracy loss when differentiating the Arnoldi iteration on a Hilbert matrix in double precision ($\phi:$ decompose with a full-rank Arnoldi iteration, then reconstruct the original matrix; measure $\|\partial \phi - I\|$; details in \ref{['appendix-section-accuracy-loss-hilbert']}).
Figure 3: Backpropagation vs our adjoint method on a sparse matrix kolodziej2019suitesparsedavis2011universityduff1989sparse.
Figure 4: All methods find the truth.
...and 7 more figures

Theorems & Definitions (7)

Theorem 4.1: Adjoint system of the Arnoldi iteration
proof : Sketch of the proof
Theorem 4.2: Adjoint system of the Lanczos iteration
proof : Sketch of the proof
Corollary 4.3: Parameter gradients
proof : Sketch of the proof
Remark E.5: $\Sigma$

Gradients of Functions of Large Matrices

TL;DR

Abstract

Gradients of Functions of Large Matrices

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (7)