Table of Contents
Fetching ...

Revisiting inverse Hessian vector products for calculating influence functions

Yegor Klochkov, Yang Liu

TL;DR

It is shown that the three hyperparameters -- the scaling factor, the batch size, and the number of steps -- can be chosen depending on the spectral properties of the Hessian, particularly its trace and largest eigenvalue.

Abstract

Influence functions are a popular tool for attributing a model's output to training data. The traditional approach relies on the calculation of inverse Hessian-vector products (iHVP), but the classical solver "Linear time Stochastic Second-order Algorithm" (LiSSA, Agarwal et al. (2017)) is often deemed impractical for large models due to expensive computation and hyperparameter tuning. We show that the three hyperparameters -- the scaling factor, the batch size, and the number of steps -- can be chosen depending on the spectral properties of the Hessian, particularly its trace and largest eigenvalue. By evaluating with random sketching (Swartworth and Woodruff, 2023), we find that the batch size has to be sufficiently large for LiSSA to converge; however, for all of the models we consider, the requirement is mild. We confirm our findings empirically by comparing to Proximal Bregman Retraining Functions (PBRF, Bae et al. (2022)). Finally, we discuss what role the inverse Hessian plays in calculating the influence.

Revisiting inverse Hessian vector products for calculating influence functions

TL;DR

It is shown that the three hyperparameters -- the scaling factor, the batch size, and the number of steps -- can be chosen depending on the spectral properties of the Hessian, particularly its trace and largest eigenvalue.

Abstract

Influence functions are a popular tool for attributing a model's output to training data. The traditional approach relies on the calculation of inverse Hessian-vector products (iHVP), but the classical solver "Linear time Stochastic Second-order Algorithm" (LiSSA, Agarwal et al. (2017)) is often deemed impractical for large models due to expensive computation and hyperparameter tuning. We show that the three hyperparameters -- the scaling factor, the batch size, and the number of steps -- can be chosen depending on the spectral properties of the Hessian, particularly its trace and largest eigenvalue. By evaluating with random sketching (Swartworth and Woodruff, 2023), we find that the batch size has to be sufficiently large for LiSSA to converge; however, for all of the models we consider, the requirement is mild. We confirm our findings empirically by comparing to Proximal Bregman Retraining Functions (PBRF, Bae et al. (2022)). Finally, we discuss what role the inverse Hessian plays in calculating the influence.
Paper Structure (23 sections, 3 theorems, 78 equations, 7 figures, 3 tables)

This paper contains 23 sections, 3 theorems, 78 equations, 7 figures, 3 tables.

Key Result

Theorem 1

Suppose, $\eta < 1 / (\lambda_{\max}(H) + \lambda)$. Then, we have convergence in-expectation Furthermore, assume that $\eta > 0$, $\delta \in (0, 1)$ are such that Then, where we interpret $\tilde{\Delta} = \mathbf{E} \| (H - \tilde{H}_t) \mathbf{u}^{*}\|^{2}$ as a sampling error.

Figures (7)

  • Figure 1: Comparison of PBRF and LiSSA influence. The first row shows examples of training images. Below, the $x$-axis represents LiSSA influences, and the $y$-axis represents the PBRF influences corresponding to each training image and 500 test images. The second row is for ResNet-18, and the third row is for ResNet-50.
  • Figure 2: Convergence of LiSSA for ResNet-18 with different batch size configurations. We calculate the correlation between test influences at steps 1..1000 of LiSSA. The result for the small batch size of 10 is averaged over 10 trials, so that the amount of data used in the middle and rightmost figures is the same.
  • Figure 3: Similarity between 20 sentences, see complete list in in Appendix, Section \ref{['similarity_prompts']}. Left figure shows influence similarity calculated with LiSSA, middle --- gradient similarity, right --- the difference between the former and the latter. In the rightmost figure the numbers show the mean over each 10x10 square, with standard error in the brackets. We use OPT 1.3B model, with $\lambda = 5.0$, $T = 1000$ and $\eta = 0.003$. We also use batch size of $4$ sequences, each consisting of $512$ tokens.
  • Figure 4: Compring traces of LHS and RHS of condition \ref{['eq:condition1']} for different batch sizes. We evaluate the traces for ResNet-18, ResNet-50, and OPT-1.3B, 4 batch sizes for each model. For the OPT-1.3B, the batch size is counted in tokens.
  • Figure 5: Comparison of PBRF and LiSSA influence on ResNet-18 for 25 random train images. Each graph shows influence of one train image w.r.t. to 500 other test images. Reference number is show above the image, refer to Figure \ref{['fig:retrain_examples']}. The results are for ResNet-18, the $x$-axis is LiSSA, and the $y$-axis is PBRF.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Corollary 1
  • Remark 1
  • Lemma 1
  • proof