Table of Contents
Fetching ...

Dissecting Fine-Tuning Unlearning in Large Language Models

Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, Haiqin Yang

TL;DR

The limitations of fine-tuning-based unlearning are explored through activation patching and parameter restoration experiments, revealing that these methods alter the model’s knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parameters.

Abstract

Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear. In this work, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model's knowledge retrieval process, providing further evidence that they do not genuinely erase the problematic knowledge embedded in the model parameters. Instead, the coefficients generated by the MLP components in the model's final layer are the primary contributors to these seemingly positive unlearning effects, playing a crucial role in controlling the model's behaviors. Furthermore, behavioral tests demonstrate that this unlearning mechanism inevitably impacts the global behavior of the models, affecting unrelated knowledge or capabilities. The code is released at https://github.com/yihuaihong/Dissecting-FT-Unlearning.

Dissecting Fine-Tuning Unlearning in Large Language Models

TL;DR

The limitations of fine-tuning-based unlearning are explored through activation patching and parameter restoration experiments, revealing that these methods alter the model’s knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parameters.

Abstract

Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear. In this work, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model's knowledge retrieval process, providing further evidence that they do not genuinely erase the problematic knowledge embedded in the model parameters. Instead, the coefficients generated by the MLP components in the model's final layer are the primary contributors to these seemingly positive unlearning effects, playing a crucial role in controlling the model's behaviors. Furthermore, behavioral tests demonstrate that this unlearning mechanism inevitably impacts the global behavior of the models, affecting unrelated knowledge or capabilities. The code is released at https://github.com/yihuaihong/Dissecting-FT-Unlearning.

Paper Structure

This paper contains 13 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Results of KRS on LLaMA and OLMo under three activations patching or parameters restoration settings. We also included another setting that restores both attention and coefficients to compare the final outcomes.
  • Figure 2: Unlearning testing results on LLaMA and OLMo for each training epoch.
  • Figure 3: Results of KRS on LLaMA and OLMo under three activations patching or parameters restoration settings, isolating the effects of the two others when investigating each factor individually.