Table of Contents
Fetching ...

Empirical influence functions to understand the logic of fine-tuning

Jordan K. Matelsky, Lyle Ungar, Konrad P. Kording

TL;DR

The paper introduces empirical influence functions (EIF) to quantify how fine-tuning data affect model outputs and defines desiderata for useful influences. Using EIFs on CNNs trained with FashionMNIST and MNIST and on a Phi-3 LLM, the authors connect the observations to the neural tangent kernel (NTK) regime, showing symmetry in CNN influences but persistent asymmetry and lack of robust logical structure in LLM fine-tuning. Prompting with in-context data partly rescues the desiderata, illustrating that context can outperform pure fine-tuning for enabling logical and causal inferences. The work provides an efficient, scalable EIF framework to diagnose and potentially steer learning from fine-tuning stimuli, with implications for model alignment, interpretability, and meta-learning that aim to shape the influence of training data.

Abstract

Understanding the process of learning in neural networks is crucial for improving their performance and interpreting their behavior. This can be approximately understood by asking how a model's output is influenced when we fine-tune on a new training sample. There are desiderata for such influences, such as decreasing influence with semantic distance, sparseness, noise invariance, transitive causality, and logical consistency. Here we use the empirical influence measured using fine-tuning to demonstrate how individual training samples affect outputs. We show that these desiderata are violated for both for simple convolutional networks and for a modern LLM. We also illustrate how prompting can partially rescue this failure. Our paper presents an efficient and practical way of quantifying how well neural networks learn from fine-tuning stimuli. Our results suggest that popular models cannot generalize or perform logic in the way they appear to.

Empirical influence functions to understand the logic of fine-tuning

TL;DR

The paper introduces empirical influence functions (EIF) to quantify how fine-tuning data affect model outputs and defines desiderata for useful influences. Using EIFs on CNNs trained with FashionMNIST and MNIST and on a Phi-3 LLM, the authors connect the observations to the neural tangent kernel (NTK) regime, showing symmetry in CNN influences but persistent asymmetry and lack of robust logical structure in LLM fine-tuning. Prompting with in-context data partly rescues the desiderata, illustrating that context can outperform pure fine-tuning for enabling logical and causal inferences. The work provides an efficient, scalable EIF framework to diagnose and potentially steer learning from fine-tuning stimuli, with implications for model alignment, interpretability, and meta-learning that aim to shape the influence of training data.

Abstract

Understanding the process of learning in neural networks is crucial for improving their performance and interpreting their behavior. This can be approximately understood by asking how a model's output is influenced when we fine-tune on a new training sample. There are desiderata for such influences, such as decreasing influence with semantic distance, sparseness, noise invariance, transitive causality, and logical consistency. Here we use the empirical influence measured using fine-tuning to demonstrate how individual training samples affect outputs. We show that these desiderata are violated for both for simple convolutional networks and for a modern LLM. We also illustrate how prompting can partially rescue this failure. Our paper presents an efficient and practical way of quantifying how well neural networks learn from fine-tuning stimuli. Our results suggest that popular models cannot generalize or perform logic in the way they appear to.
Paper Structure (22 sections, 6 equations, 6 figures)

This paper contains 22 sections, 6 equations, 6 figures.

Figures (6)

  • Figure 1: Results of fine-tuning a CNN model on out-of-domain data.A. Train on Fashion-MNIST. We first train a CNN model to correctly classify fashion items like shoes and purses. B. Fine-tune on a single MNIST sample. We then fine-tune that model on a single example from the MNIST digit dataset. Here, we train the model to "correctly" classify the digit 7 as a shoe. C. Introduce noise. We add different levels of noise to simulate degraded data quality. D. The pairwise EIF matrix. We show a largely symmetric influence function. The blockwise pattern that emerges shows that digits are most informative about other digits of the same class.
  • Figure 2: The transitivity training example.A. Desired behavior for the first sample of training text in the matrix. Here, we fine-tune a model on the two training samples, a {A} is a {B} and all {B}s have a {C}. We also show (vertical labels) the desiderata for which we are looking, per column. The obvious implication is that a {A} should have a {C} --- in other words, the $\Delta$conditional probability should increase. Paradoxically, the conditional probability of producing the tokens a {A} has a {C} is barely affected, as is a {A} does not have {C}. B. The pairwise matrix of influence from training samples (vertical) to inference samples (horizontal) for learned EIF (left) and prompted EIF (right). Horizontal "banding" patterns indicate that a training example uniformly influences other samples. Vertical banding indicates that an output tends to be made more likely when fine-tuning on any of the training samples.
  • Figure 3: Causal and logical induction in the chain_induces training domain. We would expect a learning machine to be able to describe the directionality of the causal arrow, but training on a chain of causes ($A\rightarrow B\rightarrow C\rightarrow D\rightarrow Z$, or even short subsections of this chain), makes it no more likely to produce the correct arrow from $A\rightarrow Z$, versus the incorrect arrow, $Z\rightarrow A$. Providing this information in the prompt rather than in training samples (right pane) rescues the model's performance.
  • Figure 4: Ontological reasoning in the belongs_to training domain. Set containment is an asymmetric operator: Not all rectangles are squares. But in the absence of prior knowledge of a domain, learning $A\in B$ does not endow an LLM with the ability to distinguish between $B \ni A$ and $B \in A$. An out-of-domain sample ("Frogs and toads often hibernate in winter.") is provided to illustrate that orthogonal examples are affected (the EIF shown in the last column and row are slightly negative). On the right, we illustrate that including information in the prompt rather than in training samples partially rescues the model's performance on ontological reasoning, but token order is unreasonably important.
  • Figure 5: A. Histogram of empirical influence function values, showing diffusivity of influence. The distribution of EIF values for learned knowledge is much more diffuse and positive-valued than that of prompted, in-context processing. We interpret this to suggest that models pull both relevant and irrelevant information when formulating a response, rather than retrieving specific, relevant information, as we would hope. Note the negative tail of the distribution, unique to the prompting condition. The histogram is shown for all synthetic knowledge domains. Individual histograms per domain are reported in Supplemental Material. (AU = arbitrary units.) B. Symmetry measure breakdown per domain. For all knowledge domains, prompting leads to a more asymmetric EIF than fine-tuning.
  • ...and 1 more figures