Table of Contents
Fetching ...

Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

Andrea Schioppa

TL;DR

This work presents a novel framework for scalable gradient and HVP sketching, tailored for modern hardware, and sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.

Abstract

The study of modern machine learning models often necessitates storing vast quantities of gradients or Hessian vector products (HVPs). Traditional sketching methods struggle to scale under these memory constraints. We present a novel framework for scalable gradient and HVP sketching, tailored for modern hardware. We provide theoretical guarantees and demonstrate the power of our methods in applications like training data attribution, Hessian spectrum analysis, and intrinsic dimension computation for pre-trained language models. Our work sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.

Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

TL;DR

This work presents a novel framework for scalable gradient and HVP sketching, tailored for modern hardware, and sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.

Abstract

The study of modern machine learning models often necessitates storing vast quantities of gradients or Hessian vector products (HVPs). Traditional sketching methods struggle to scale under these memory constraints. We present a novel framework for scalable gradient and HVP sketching, tailored for modern hardware. We provide theoretical guarantees and demonstrate the power of our methods in applications like training data attribution, Hessian spectrum analysis, and intrinsic dimension computation for pre-trained language models. Our work sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.
Paper Structure (52 sections, 6 theorems, 54 equations, 7 figures, 11 tables)

This paper contains 52 sections, 6 theorems, 54 equations, 7 figures, 11 tables.

Key Result

Theorem 3.1

There are some inputs $x$ for which FFD does not satisfy the sketching property eq:jl_prob.

Figures (7)

  • Figure 1: Diagram to illustrate our proposed sketching algorithms.
  • Figure 2: left: ratio (RNEG) of the absolute value of the top negative to the top positive eigenvalue; right: ratio $R$ of the $n$-th largest positive eigenvalue to the largest positive eigenvalue. We define outliers when $R>20\%$, motivated by behrooz-eigens. Higher-resolution versions for printing can be found in Appendix \ref{['appx:add-experiments']}. These results disprove conjectures on the Hessian structure, see Sec. \ref{['subsec:eigen_evol']}.
  • Figure 3: Peak memory usage comparing FJL with AFJL. Results on GPU (V100); for FJL results with $D>2^{20}$ are not reported as there were Out-of-Memory errors.
  • Figure 4: Wall time comparing FJL with AFJL. Results on GPU (V100); for FJL results with $D>2^{20}$ are not reported as there were Out-of-Memory errors.
  • Figure 5: ratio (RNEG) of the absolute value of the top negative to the top positive eigenvalue
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem C.1
  • proof
  • Theorem C.2
  • proof
  • Theorem C.3
  • proof