Table of Contents
Fetching ...

Learning and Unlearning of Fabricated Knowledge in Language Models

Chen Sun, Nolan Andrew Miller, Andrey Zhmoginov, Max Vladymyrov, Mark Sandler

TL;DR

This question is investigated by injecting facts into LMs from a new probing dataset, "Outlandish", designed to permit the testing of a spectrum of different fact types, and it is shown that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge are remembered much more rapidly.

Abstract

What happens when a new piece of knowledge is introduced into the training data and how long does it last while a large language model (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime'' how the language model hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.

Learning and Unlearning of Fabricated Knowledge in Language Models

TL;DR

This question is investigated by injecting facts into LMs from a new probing dataset, "Outlandish", designed to permit the testing of a spectrum of different fact types, and it is shown that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge are remembered much more rapidly.

Abstract

What happens when a new piece of knowledge is introduced into the training data and how long does it last while a large language model (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime'' how the language model hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.

Paper Structure

This paper contains 20 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: Depiction of results from Fig. 3 on a Wundt-like curve.
  • Figure 2: Bold red line along X-axis on plots denotes the period of false fact inception. FT and KCF in the plot legends are defined respectively as next-token-prediction accuracy (%) on the finetuning validation set and the inserted knowledge-conflicting fact. (a) Longevity of CounterFact memories in LM while undergoing finetuning. (b) Example of CounterFact fact. (c-e) Longevity of knowledge-conflicting facts, where 200 varied phrasings are presented either (c) solely at the beginning of the finetuning period and then never again, (e) at regular intervals over the course of 5000 iterations of finetuning and then never again. See Section \ref{['sec:analysis']} for plot details. (d) Example of two syntactically varied phrasings of a single false fact with the same keywords and semantic meaning.
  • Figure 3: Bold red line along X-axis on plots denotes the period of false fact inception. Plotted is KCF longevity as a function of (a) the number of exclusive presentations of KCF at the onset of finetuning and (b) the density of KCF presentations at regular intervals during the finetuning. Notice the step-like nature of the perplexity plot (left) at the density of 1 KCF per 1200 iterations. The steps occurs each time a single knowledge-conflicting fact is presented, and the memory of this single presentation carries over a thousand iterations to the next single occurrence.
  • Figure 4: Red line on plots denotes period of false fact inception. FT and KCF denote respectively the next-token-prediction accuracy (%) on the finetuning validation set and the inserted knowledge-conflicting fact. (a) Examples of mundane and randomized facts corresponding to the example KCF given in Fig \ref{['fig:fig1']}d. Note that all three share the same keywords. (b-c) Longevity of KCFs vs mundane or randomized versions after injection into PALM-8B while the model undergoes finetuning. See Section \ref{['sec:analysis']} for analysis details. (d-e) Insertion of a KCF into the language model "primes" how the model hallucinates in other, logically unrelated prompts. (d) compares the priming effect after inserting KCF vs mundane and randomly jumbled fact, applied to the 3 different prefixes displayed in (e).
  • Figure 5: FT denotes finetuning. (a) Impact of masking on different percentages of KCF parameter updates (bottom vs top k%). (b) Example trace showing the effect of our sparsification procedure on the KCF memory. (c) Summary data showing the effect of our sparsification procedure on the memory of the finetune task versus memory of the KCF. At 85% sparsification (green line), the KCF has been nearly entirely erased while finetuning had been largely unaffected. This was robust over a 32 fold range of KCF presentation density during such finetuning.
  • ...and 3 more figures