Table of Contents
Fetching ...

Unforgettable Generalization in Language Models

Eric Zhang, Leshem Chosen, Jacob Andreas

TL;DR

This work studies the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels, and shows that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting.

Abstract

When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.

Unforgettable Generalization in Language Models

TL;DR

This work studies the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels, and shows that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting.

Abstract

When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.
Paper Structure (22 sections, 3 equations, 9 figures)

This paper contains 22 sections, 3 equations, 9 figures.

Figures (9)

  • Figure 1: Stylized learning and forgetting curves. Our experiments first fine-tune a pre-trained LM, then train it further on random labels. We call the gap between the forget accuracy and the random chance accuracy (50%) the forget gap. In many tasks we find a nonzero forget gap: after training on random labels, LMs do not generalizably learn to produce random outputs on new task instances.
  • Figure 2: Single task forgetting. Top: The blue arrow visualizes the change in held-out accuracy after fine-tuning and the red arrow illustrates the change in accuracy after forgetting. We find that many tasks do not return to the expected accuracy of 50% after forgetting. Bottom left: The forget gap (difference between forgetting accuracy and the expected random accuracy of 1/2) across tasks. Smaller values correspond to a greater degree of forgetting. Bottom right: The forget ratio (the difference fine-tuned accuracy and the forget accuracy over the difference between fine-tuned accuracy and the expected random accuracy of 1/2). Larger forget ratios correspond to more successful forgetting.
  • Figure 3: Cross-task forgetting (higher values indicate more successful forgetting). We fine-tune the model on random labels from one task and then evaluate the model on another task. The vertical axis displays the task the model was trained to forget and the horizontal axis displays the task the model was evaluated on. Surprisingly, certain capabilities are robust to forgetting even after fine-tuning on random labels. Moreover, the effectiveness of the forgetting procedure is largely determined by the tasks that the model is evaluated on, not the tasks that the model was trained to forget. Note that rows and columns are presented in different orders, and clustered using the UPGMA algorithm upgma
  • Figure 4: Predictors of the Forget Ratio (y-axis). Each point is a different task. Top: The accuracy on the task after fine-tuning. The effectiveness of the forgetting procedure is not determined by the difficulty of the task (as measured by accuracy). Middle: The variance of the hidden state of the last token of the question in the fifth to last layer across examples. This variance is somewhat predictive of amount forgotten, indicating that "broader" tasks are more difficult to forget. Bottom: Model's confidence in the correct response. Probability relative to the distractor is predictive of forgetting, indicating that models forget more examples they were already not confident about.
  • Figure 5: Forgetting order vs learning order. The horizontal axis shows the forgetting time: the number of epochs until the model forgets (assigns < 60% accuracy to the correct response for a data point). The vertical axis shows the learning time: the number of epochs until the model learns (assigns > 60% confidence to the correct label for a data point). We filter out the examples that are never learned or never forgotten. If fewer than 100 examples fulfil the criteria, we do not plot the task. Overall, we find that learning and forgetting orders are weakly, but consistently, anticorrelated.
  • ...and 4 more figures