Intrinsic Test of Unlearning Using Parametric Knowledge Traces
Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, Mor Geva
TL;DR
<3-5 sentence high-level summary> The paper addresses the problem of evaluating unlearning in large language models beyond surface behavior by examining internal parametric knowledge traces. It introduces ConceptVectors, a benchmark that uses vocabulary-projection-based concept vectors to identify and measure parametric knowledge associated with specific concepts, and pairs this with behavioural tests. The authors show that existing unlearning methods largely suppress behavioural evidence of forgetting but leave parametric traces intact, and that targeted ablation of concept vectors can erase the knowledge and reduce susceptibility to jailbreaks. They demonstrate that jailbreak attacks can reactivate erased knowledge and argue that intrinsic, parameter-based evaluation is essential for robust unlearning, proposing a framework and public code for future work.
Abstract
The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
