Table of Contents
Fetching ...

Intrinsic Test of Unlearning Using Parametric Knowledge Traces

Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, Mor Geva

TL;DR

<3-5 sentence high-level summary> The paper addresses the problem of evaluating unlearning in large language models beyond surface behavior by examining internal parametric knowledge traces. It introduces ConceptVectors, a benchmark that uses vocabulary-projection-based concept vectors to identify and measure parametric knowledge associated with specific concepts, and pairs this with behavioural tests. The authors show that existing unlearning methods largely suppress behavioural evidence of forgetting but leave parametric traces intact, and that targeted ablation of concept vectors can erase the knowledge and reduce susceptibility to jailbreaks. They demonstrate that jailbreak attacks can reactivate erased knowledge and argue that intrinsic, parameter-based evaluation is essential for robust unlearning, proposing a framework and public code for future work.

Abstract

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

Intrinsic Test of Unlearning Using Parametric Knowledge Traces

TL;DR

<3-5 sentence high-level summary> The paper addresses the problem of evaluating unlearning in large language models beyond surface behavior by examining internal parametric knowledge traces. It introduces ConceptVectors, a benchmark that uses vocabulary-projection-based concept vectors to identify and measure parametric knowledge associated with specific concepts, and pairs this with behavioural tests. The authors show that existing unlearning methods largely suppress behavioural evidence of forgetting but leave parametric traces intact, and that targeted ablation of concept vectors can erase the knowledge and reduce susceptibility to jailbreaks. They demonstrate that jailbreak attacks can reactivate erased knowledge and argue that intrinsic, parameter-based evaluation is essential for robust unlearning, proposing a framework and public code for future work.

Abstract

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
Paper Structure (45 sections, 10 figures, 9 tables)

This paper contains 45 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Illustration of our key contributions: (a) we create a benchmark for evaluating the ability of unlearning methods to erase parametric knowledge, (b) we show that existing unlearning methods suppress the usage of parametric knowledge without erasing it, but (c) the residual knowledge can be unsuppressed with jailbreaking, and (d) ablating this knowledge is important for robust unlearning.
  • Figure 2: Illustration of our methodology for generating parametric and behavioural evaluations for unlearning: (1) We localize parametric concept vectors using vocabulary projections, (2) for every identified concept, we use GPT-4 to generate simple questions about the concept and obtain the model's answers before unlearning, (3) we validate that the identified concepts exhibit causal effects on the model's outputs about the concept but not on other concepts.
  • Figure 3: Jailbreak results for LLaMA and OLMo on the selected 10 concepts.
  • Figure 4: Concept Validation Experiments Results for LLaMA2-7B-chat and OLMo-7B. The first two plots show the average BLEU and Rouge-L scores across the entire ConceptVectors dataset for LLaMA and OLMo before and after disrupting the corresponding concept vectors with Gaussian noise. The latter two plots display the specific distribution of BLEU scores for target QA and unrelated knowledge QA after experiments on both models.
  • Figure 5: Jailbreak results for LLaMA (left) and OLMo (right) using Rouge-L score as the metric.
  • ...and 5 more figures