Table of Contents
Fetching ...

Position: Curvature Matrices Should Be Democratized via Linear Operators

Felix Dangel, Runa Eschenhagen, Weronika Ormaniec, Andres Fernandez, Lukas Tatzel, Agustinus Kristiadi

TL;DR

The work argues that curvature matrices central to neural-network training and analysis should be accessed via a unified linear-operator interface, enabling scalable, matrix-free computation and easy integration with existing tools. It introduces curvlinops, a PyTorch library that exposes Hessian, GGN, Fisher variants, and KFAC forms as linear operators, with safeguards, batch handling, and interoperability features. The approach demonstrates how operator abstractions simplify applications (e.g., second-order optimization, influence functions, model merging, pruning, and loss analysis), while enabling extensibility through connections to randomized linear algebra and SciPy ecosystems. This democratizes access to advanced curvature techniques, offering practical impact for large-scale models and diverse ML tasks, and sets the stage for future improvements like multi-GPU support and differentiable operators.

Abstract

Structured large matrices are prevalent in machine learning. A particularly important class is curvature matrices like the Hessian, which are central to understanding the loss landscape of neural nets (NNs), and enable second-order optimization, uncertainty quantification, model pruning, data attribution, and more. However, curvature computations can be challenging due to the complexity of automatic differentiation, and the variety and structural assumptions of curvature proxies, like sparsity and Kronecker factorization. In this position paper, we argue that linear operators -- an interface for performing matrix-vector products -- provide a general, scalable, and user-friendly abstraction to handle curvature matrices. To support this position, we developed $\textit{curvlinops}$, a library that provides curvature matrices through a unified linear operator interface. We demonstrate with $\textit{curvlinops}$ how this interface can hide complexity, simplify applications, be extensible and interoperable with other libraries, and scale to large NNs.

Position: Curvature Matrices Should Be Democratized via Linear Operators

TL;DR

The work argues that curvature matrices central to neural-network training and analysis should be accessed via a unified linear-operator interface, enabling scalable, matrix-free computation and easy integration with existing tools. It introduces curvlinops, a PyTorch library that exposes Hessian, GGN, Fisher variants, and KFAC forms as linear operators, with safeguards, batch handling, and interoperability features. The approach demonstrates how operator abstractions simplify applications (e.g., second-order optimization, influence functions, model merging, pruning, and loss analysis), while enabling extensibility through connections to randomized linear algebra and SciPy ecosystems. This democratizes access to advanced curvature techniques, offering practical impact for large-scale models and diverse ML tasks, and sets the stage for future improvements like multi-GPU support and differentiable operators.

Abstract

Structured large matrices are prevalent in machine learning. A particularly important class is curvature matrices like the Hessian, which are central to understanding the loss landscape of neural nets (NNs), and enable second-order optimization, uncertainty quantification, model pruning, data attribution, and more. However, curvature computations can be challenging due to the complexity of automatic differentiation, and the variety and structural assumptions of curvature proxies, like sparsity and Kronecker factorization. In this position paper, we argue that linear operators -- an interface for performing matrix-vector products -- provide a general, scalable, and user-friendly abstraction to handle curvature matrices. To support this position, we developed , a library that provides curvature matrices through a unified linear operator interface. We demonstrate with how this interface can hide complexity, simplify applications, be extensible and interoperable with other libraries, and scale to large NNs.

Paper Structure

This paper contains 40 sections, 16 equations, 3 figures.

Figures (3)

  • Figure 1: Visual tour of curvature matrices. White lines separate parameters into layers. We consider a synthetic classification task with a small convolutional neural net (three convolutional and one dense layer with ReLU and sigmoid activations, $D=683$).
  • Figure 2: Performance analysis: Run time (left column) and peak memory (right column) of linear operators benchmarked on ResNet50 on ImageNet (top row) and nanoGPT on Shakespeare (bottom row) on an A40 GPU with 40 GiB of RAM (the code used to generate these results is \linkToGithub/blob/75bc0a84b2001f052daeaeac9a58846f379fed8a/docs/examples/basic_usage/example_benchmark.py). Details: For ImageNet, we use a batch size of $64$ and images of shape $(3, 224, 224)$; for Shakespeare, we use a batch size of $4$ and context length $1024$. All linear operators use their default options and we do not use compilation. Models are in evaluation mode. KFAC neglects parameters in normalization layers (they are unsupported), and nanoGPT's last layer due to its large dimension ($\approx$50 K).
  • Figure 3: Estimating linear operator properties with curvlinops. We implement various estimation algorithms from the literature and evaluate them on toy problems. Top: Spectral density estimation with the algorithms and toy matrices from papyan2020prevalence. The left panel estimates a spectral density, the right panel the spectral density of the matrix logarithm $\log(|{\bm{A}}| + \epsilon {\bm{I}})$ with $\epsilon= 10^{-5}$. Code to reproduce these figures is \linkToDocs/en/latest/basic_usage/example_verification_spectral_density.html. Bottom: Comparison of trace and diagonal estimators girard1989montecarlohutchinson1989stochasticepperly2024xtracemeyer2020hutch for matrices whose spectrum follows a power law (${\bm{Q}}$ is obtained from the QR decomposition of a random Gaussian matrix). Solid lines are medians, error bars are 25- and 75-percentiles over 200 runs. For traces, we use the relative error $|\hat{t} - t|/|t|$ where $\hat{t}$ estimates the true trace $t$. For diagonals, we report the relative error $\max_i |a_i - \hat{a}_i|/\max_j|a_j|$ where $\hat{{\bm{a}}}$ approximates the true diagonal ${\bm{a}}$. On matrices with fast spectral decay, estimation techniques based on variance reduction improve over vanilla Hutchinson estimators.