Table of Contents
Fetching ...

Mitigating Memorization In Language Models

Mansi Sakarvadia, Aswathy Ajith, Arham Khan, Nathaniel Hudson, Caleb Geniesse, Kyle Chard, Yaoqing Yang, Ian Foster, Michael W. Mahoney

TL;DR

The proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks, and it is shown that this method outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.

Abstract

Language models (LMs) can "memorize" information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.

Mitigating Memorization In Language Models

TL;DR

The proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks, and it is shown that this method outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.

Abstract

Language models (LMs) can "memorize" information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.
Paper Structure (45 sections, 5 equations, 22 figures, 9 tables, 6 algorithms)

This paper contains 45 sections, 5 equations, 22 figures, 9 tables, 6 algorithms.

Figures (22)

  • Figure 1: Memorization Mitigation Strategies. Overview of the methods that we compare and contrast in this work. Green methods are new strategies that we propose.
  • Figure 2: Machine Unlearning strategies localize information within the neuron/weights of an LM, which can subsequently be dropped to facilitate "unlearning" of information. Box plot compares the best neuron vs. weight-based unlearning methods for Pythia models (see Appendix \ref{['appendix:unlearning_hp_search_pythia']}). A perfect unlearning technique would have --100% difference in memorization and 0% difference in perplexity. Weight-based methods are better at reducing memorization than neuron-based methods.
  • Figure 3: Unlearning strategies comparison. (Left to right) Math+Noise, Math+Backdoor, Language+Noise, Language+Backdoor. Comparing unlearning strategies for varying model sizes, unlearning times, and data size. Effective unlearning techniques will result in 0% different in accuracy for math models or a 0% difference in perplexity for langauge models and -100% different in % memorized. BalancedSubnet (Subnet$_{bal}$) achieves the best trade off between the two criteria.
  • Figure 4: Loss landscapes for the Pythia 2.8B model. (a) Original model's landscape. (b) Well edited model's landscape using BalancedSubnet with well configured HPs. (c) Badly edited model's landscape using Subnet with poorly configured HPs. While the good edit does not appear to change the landscape much, the bad edit drastically changes the loss landscape. Details regarding creation of this visualization are found in Appendix \ref{['appendix:loss_landscape']}.
  • Figure 5: Unlearning strategies comparison. Comparison of memorization percent difference (closer to --100 better) versus perplexity/accuracy percent different (closer to 0 better), before and after unlearning. Each math and language model result is averaged over three seeds. Math and language model types are described in \ref{['sec:unlearning_under_more_types_of_models']}. Pythia models are described in \ref{['sec:prod_unlearn']}.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Definition 2.1: Memorization
  • Definition 2.2: Noise artifact
  • Definition 2.3: Backdoor artifact