Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

Andrea Schioppa

Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

Andrea Schioppa

TL;DR

This work presents a novel framework for scalable gradient and HVP sketching, tailored for modern hardware, and sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.

Abstract

The study of modern machine learning models often necessitates storing vast quantities of gradients or Hessian vector products (HVPs). Traditional sketching methods struggle to scale under these memory constraints. We present a novel framework for scalable gradient and HVP sketching, tailored for modern hardware. We provide theoretical guarantees and demonstrate the power of our methods in applications like training data attribution, Hessian spectrum analysis, and intrinsic dimension computation for pre-trained language models. Our work sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.

Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

TL;DR

Abstract

Paper Structure (52 sections, 6 theorems, 54 equations, 7 figures, 11 tables)

This paper contains 52 sections, 6 theorems, 54 equations, 7 figures, 11 tables.

Introduction
Related Work
Sketching
Intrinsic dimension
Scaling up influence functions
Hessian evolution during training
Design Principles for Efficient Sketching Algorithms
Dense sketches and FJL
FFD: Implicit Gradient sketching
Explicit sketches.
Removing the lookup bottleneck.
Alternative pre-conditioners.
Direct usage of the pre-conditioner $Q$.
Diagrams.
Theoretical results
...and 37 more sections

Key Result

Theorem 3.1

There are some inputs $x$ for which FFD does not satisfy the sketching property eq:jl_prob.

Figures (7)

Figure 1: Diagram to illustrate our proposed sketching algorithms.
Figure 2: left: ratio (RNEG) of the absolute value of the top negative to the top positive eigenvalue; right: ratio $R$ of the $n$-th largest positive eigenvalue to the largest positive eigenvalue. We define outliers when $R>20\%$, motivated by behrooz-eigens. Higher-resolution versions for printing can be found in Appendix \ref{['appx:add-experiments']}. These results disprove conjectures on the Hessian structure, see Sec. \ref{['subsec:eigen_evol']}.
Figure 3: Peak memory usage comparing FJL with AFJL. Results on GPU (V100); for FJL results with $D>2^{20}$ are not reported as there were Out-of-Memory errors.
Figure 4: Wall time comparing FJL with AFJL. Results on GPU (V100); for FJL results with $D>2^{20}$ are not reported as there were Out-of-Memory errors.
Figure 5: ratio (RNEG) of the absolute value of the top negative to the top positive eigenvalue
...and 2 more figures

Theorems & Definitions (9)

Theorem 3.1
Theorem 3.2
Theorem 3.3
Theorem C.1
proof
Theorem C.2
proof
Theorem C.3
proof

Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

TL;DR

Abstract

Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (9)