Table of Contents
Fetching ...

Extracting Unlearned Information from LLMs with Activation Steering

Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, Stephan Günnemann

TL;DR

This work proposes activation steering as a method for exact information retrieval from unlearned LLMs, and introduces a novel approach to generating steering vectors, named Anonymized Activation Steering, which successfully recovers general knowledge.

Abstract

An unintended consequence of the vast pretraining of Large Language Models (LLMs) is the verbatim memorization of fragments of their training data, which may contain sensitive or copyrighted information. In recent years, unlearning has emerged as a solution to effectively remove sensitive knowledge from models after training. Yet, recent work has shown that supposedly deleted information can still be extracted by malicious actors through various attacks. Still, current attacks retrieve sets of possible candidate generations and are unable to pinpoint the output that contains the actual target information. We propose activation steering as a method for exact information retrieval from unlearned LLMs. We introduce a novel approach to generating steering vectors, named Anonymized Activation Steering. Additionally, we develop a simple word frequency method to pinpoint the correct answer among a set of candidates when retrieving unlearned information. Our evaluation across multiple unlearning techniques and datasets demonstrates that activation steering successfully recovers general knowledge (e.g., widely known fictional characters) while revealing limitations in retrieving specific information (e.g., details about non-public individuals). Overall, our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.

Extracting Unlearned Information from LLMs with Activation Steering

TL;DR

This work proposes activation steering as a method for exact information retrieval from unlearned LLMs, and introduces a novel approach to generating steering vectors, named Anonymized Activation Steering, which successfully recovers general knowledge.

Abstract

An unintended consequence of the vast pretraining of Large Language Models (LLMs) is the verbatim memorization of fragments of their training data, which may contain sensitive or copyrighted information. In recent years, unlearning has emerged as a solution to effectively remove sensitive knowledge from models after training. Yet, recent work has shown that supposedly deleted information can still be extracted by malicious actors through various attacks. Still, current attacks retrieve sets of possible candidate generations and are unable to pinpoint the output that contains the actual target information. We propose activation steering as a method for exact information retrieval from unlearned LLMs. We introduce a novel approach to generating steering vectors, named Anonymized Activation Steering. Additionally, we develop a simple word frequency method to pinpoint the correct answer among a set of candidates when retrieving unlearned information. Our evaluation across multiple unlearning techniques and datasets demonstrates that activation steering successfully recovers general knowledge (e.g., widely known fictional characters) while revealing limitations in retrieving specific information (e.g., details about non-public individuals). Overall, our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.

Paper Structure

This paper contains 11 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: An example anonymization of a question.
  • Figure 2: Visual representation for AnonAct.
  • Figure 3: Experiment results for the Harry Potter dataset. (\ref{['fig:hp']}) shows the comparison of CAFs between the sampling with an unlearned model without and with using AnonAct. Questions on the x-axis are sorted in ascending order by the difference in the CAF. (\ref{['fig:roc']}) displays the RoC plots for the base model, unlearned model, and with AnonAct, using keyword frequencies as scores.
  • Figure 4: The CAFs from sampling without and with AnonAct for the TOFU experiment. Questions are sorted in ascending order by the increase in performance.
  • Figure 5: Next token probabilities for the completion of the sentences, between the unlearned model and with AnonAct. Blue is the original answer, and red is the replaced one.