RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models

Bichen Wang; Yuzhe Zi; Yixin Sun; Yanyan Zhao; Bing Qin

RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models

Bichen Wang, Yuzhe Zi, Yixin Sun, Yanyan Zhao, Bing Qin

TL;DR

This work tackles the challenge of removing personal information from large language models to comply with RTBF and GDPR. It introduces RKLD, a reverse KL-divergence-based knowledge distillation framework that constructs an unlearning teacher via continued training on the forget set and uses reverse KL loss to guide forgetting while preserving other token distributions. Experiments on the TOFU benchmark show RKLD achieves strong forget quality with robust model utility, outperforming existing baselines and demonstrating resilience across varying forgetting scales. An ablation confirms the superiority of reverse KL over forward KL for this selective forgetting, and a case study highlights the importance of thorough unlearning to prevent information leakage.

Abstract

With the passage of the Right to Be Forgotten (RTBF) regulations and the scaling up of language model training datasets, research on model unlearning in large language models (LLMs) has become more crucial. Before the era of LLMs, machine unlearning research focused mainly on classification tasks in models with small parameters. In these tasks, the content to be forgotten or retained is clear and straightforward. However, as parameter sizes have grown and tasks have become more complex, balancing forget quality and model utility has become more challenging, especially in scenarios involving personal data instead of classification results. Existing methods based on gradient ascent and its variants often struggle with this balance, leading to unintended information loss or partial forgetting. To address this challenge, we propose RKLD, a novel \textbf{R}everse \textbf{KL}-Divergence-based Knowledge \textbf{D}istillation unlearning algorithm for LLMs targeting the unlearning of personal information. Through RKLD, we achieve significant forget quality and effectively maintain the model utility in our experiments.

RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 3 figures, 4 tables)

This paper contains 22 sections, 6 equations, 3 figures, 4 tables.

Introduction
Related Work
Machine Unlearning
Machine Unlearning for LLMs
RKLD:Reverse KL-Divergence-based Knowledge Distillation for Unlearning
Task Definition
Constrcuting Unlearning Teacher
Unlearning Distillation
Experiment Setups
TOFU Unlearning Benchmark
Forget Quality Metrics
Model Utility Metrics
Comparison Methods
Experiments
Main Result
...and 7 more sections

Figures (3)

Figure 1: After the unlearning process, we provided QA pairs requiring the completion of personal occupations to compare the two unlearning algorithms, Gradient Ascent (GA) and RKLD. It is evident that GA can no longer complete the sentence. This is because the goal of gradient ascent is only to reduce the probability of the golden label and neglects the protection of the remaining tokens. On the other hand, our RKLD maintains model utility through keeping the token distribution.
Figure 2: An illustration of the RKLD unlearning method. Different from existing unlearning methods, our method constructs an unlearning teacher model through continued training on the forget set, which helps to selectively forget specific information while retaining the overall model utility. The unlearning process involves two steps: 1) Continued Training: creating a teacher model by subtracting the logits of the strengthened model from the original model. 2) Unlearning Distillation: using the teacher model to guide the unlearning of the original model with the reverse KL divergence loss.
Figure 3: Forget quality versus model utility across different forget set sizes (1%, 5%, and 10% of the data). Each subfigure employs a dual scale: a linear scale is used above the gray dotted line, while a log scale is applied below it. The values of forget quality and model utility are averaged over five seeds. Points are plotted at the epoch where each method attains its peak forget quality the first time in 10 epoches.

RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models

TL;DR

Abstract

RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)