Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

Yujian Liu; Yang Zhang; Tommi Jaakkola; Shiyu Chang

Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

Yujian Liu, Yang Zhang, Tommi Jaakkola, Shiyu Chang

TL;DR

This work formalizes targeted unlearning for LLMs by framing the knowledge about an unlearning target as a confounder in a causal model and deriving a deconfounding-based training objective. It introduces a causal intervention framework that extends Who's Harry Potter (WHP) and justifies a WHP-like algorithm, augmented by aggregating multiple counterfactual teacher distributions and counterfactual prompting. A new benchmark, Wikipedia Person Unlearning (WPU), plus adaptation to the TOFU setting, demonstrates that the proposed method achieves competitive forgetting efficacy, preserves unrelated utility, reduces hallucinations, and remains robust under adversarial jailbreaks without requiring retain data. The approach provides principled design choices for targeted unlearning and offers practical insights for deploying safer and privacy-preserving LLMs, with code released at the project URL.

Abstract

This paper investigates Who's Harry Potter (WHP), a pioneering yet insufficiently understood method for LLM unlearning. We explore it in two steps. First, we introduce a new task of LLM targeted unlearning, where given an unlearning target (e.g., a person) and some unlearning documents, we aim to unlearn only the information about the target, rather than everything in the unlearning documents. We further argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks. Second, we construct a causal intervention framework for targeted unlearning, where the knowledge of the unlearning target is modeled as a confounder between LLM input and output, and the unlearning process as a deconfounding process. This framework justifies and extends WHP, deriving a simple unlearning algorithm that includes WHP as a special case. Experiments on existing and new datasets show that our approach, without explicitly optimizing for the aforementioned criteria, achieves competitive performance in all of them. Our code is available at https://github.com/UCSB-NLP-Chang/causal_unlearn.git.

Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

TL;DR

Abstract

Paper Structure (34 sections, 5 equations, 20 figures, 10 tables)

This paper contains 34 sections, 5 equations, 20 figures, 10 tables.

Introduction
Related Works
Methodology
Problem Formulation
Review of Who is Harry Potter
A Causal Intervention Framework for Targeted Unlearning
Deriving the Teacher Distribution
Training a Student LLM
Connection to Who is Harry Potter
Summary
Experiments
Dataset Construction
Forgetting Persons
Forgetting Authors and Books
Ablation Study
...and 19 more sections

Figures (20)

Figure 1: Illustration of Who's Harry Potter unlearning.
Figure 2: An example of the targeted unlearning task and desired responses. Knowledge to be forgotten (or retained) is highlighted in red (blue).
Figure 3: Causal graph for the data generation process.
Figure 4: Performance of each criterion (normalized by maximum) on WPU. Higher is better for all metrics.
Figure 5: Forget Quality ($\uparrow$) vs. Model Utility ($\uparrow$) on TOFU (average of $3$ seeds). For clarity, values above $0.1$ are in linear scale, and those below $0.1$ are in log scale.
...and 15 more figures

Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

TL;DR

Abstract

Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

TL;DR

Abstract

Table of Contents

Figures (20)