Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

Zhenhua Liu; Tong Zhu; Chuanyuan Tan; Wenliang Chen

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

Zhenhua Liu, Tong Zhu, Chuanyuan Tan, Wenliang Chen

TL;DR

This paper addresses privacy risks in LLM memorization by introducing RETURN, a real-world dataset of 2,492 individuals with 20 QA pairs each, for evaluating machine unlearning. It then presents the Name-Aware Unlearning Framework (NAUF), combining Name-Aware Refusal Answer and Contrastive Data Augmentation to protect targeted individuals while preserving performance on others. NAUF achieves a state-of-the-art average unlearning score, outperforming the best baselines by 5.65 points, and maintains downstream task accuracy, demonstrating practical privacy protection with minimal retraining. The work highlights a viable path toward RTBF-compliant LLMs and outlines future work on scaling and fine-grained privacy protections in real-world settings.

Abstract

Large language models (LLMs) exhibit remarkable capabilities in understanding and generating natural language. However, these models can inadvertently memorize private information, posing significant privacy risks. This study addresses the challenge of enabling LLMs to protect specific individuals' private data without the need for complete retraining. We propose \return, a Real-world pErsonal daTa UnleaRNing dataset, comprising 2,492 individuals from Wikipedia with associated QA pairs, to evaluate machine unlearning (MU) methods for protecting personal data in a realistic scenario. Additionally, we introduce the Name-Aware Unlearning Framework (NAUF) for Privacy Protection, which enables the model to learn which individuals' information should be protected without affecting its ability to answer questions related to other unrelated individuals. Our extensive experiments demonstrate that NAUF achieves a state-of-the-art average unlearning score, surpassing the best baseline method by 5.65 points, effectively protecting target individuals' personal data while maintaining the model's general capabilities.

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

TL;DR

Abstract

Paper Structure (41 sections, 8 equations, 8 figures, 7 tables)

This paper contains 41 sections, 8 equations, 8 figures, 7 tables.

Introduction
RETURN: Real-world pErsonal daTa UnleaRNing
Data Construction
Identifying Individuals with Deep Memorization
Evaluation Setup
Evaluation Metrics
Forget Score.
Retain Score.
Downstream Task Accuracy.
Name-Aware Unlearning Framework
Name-Aware Refusal Answer.
Contrastive Data Augmentation.
Experiments
Baseline Methods
Unlearning on Forget Set:
...and 26 more sections

Figures (8)

Figure 1: The example of extracting private information from LLMs. When an individual practices RTBF, the model should protect his/her private information.
Figure 2: The construction of RETURN and the process for evaluating Machine Unlearning (MU) methods using this dataset.
Figure 3: Accuracy distribution of LLaMA-3 on RETURN.
Figure 4: The example of CDA for an individual in the forget set. Here we take Darrell Hammond as target individual.
Figure 5: The example of CDA for an individual in the retain set. Here we take Brian Eno as target individual.
...and 3 more figures

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

TL;DR

Abstract

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)